Evaluation as a Goal Surface: Experiments, Learning Boundary, and ETH-Aware A/B

Community Article Published January 31, 2026

How to design experiments that respect goals, roles, and ethics in SI-Core

Draft v0.1 — Non-normative supplement to SI-Core / SI-NOS / EVAL / Learning Boundary / PoLB docs

Acronym note (art-60):

PLB is reserved for Pattern-Learning-Bridge (art-60-016).

This document uses Learning Boundary (LB) to mean the “policy learning boundary” concept.

Rollout governance (modes/waves/backoff) is written as PoLB (Policy Load Balancer).

This document is non-normative. It describes how to think about and operationalize evaluation in a Structured Intelligence (SI) stack: experiments, A/B tests, shadow runs, and off-policy evaluation — all treated as goal surfaces with ETH and Role/Persona overlays.

Normative behavior lives in the SI-Core / SI-NOS / Jump / ETH / EVAL specs.

0. Conventions used in this draft (non-normative)

This draft follows the portability conventions used in 069/084+ when an artifact might be exported, hashed, or attested (EvalSurfaces, experiment contracts, assignment logs, EvalTrace summaries, approval packets):

created_at is operational time (advisory unless time is attested).
as_of carries markers only (time claim + optional revocation view markers) and SHOULD declare clock_profile: "si/clock-profile/utc/v1" when exported.
trust carries digests only (trust anchors + optional revocation view digests). Never mix markers into trust.
bindings pins meaning as {id,digest} (meaningful identities must not be digest-only).
Avoid floats in policy-/digest-bound artifacts: prefer scaled integers (*_bp, *_ppm) and integer micro/milliseconds.
If you hash/attest legal/procedural artifacts, declare canonicalization explicitly: canonicalization: "si/jcs-strict/v1" and canonicalization_profile_digest: "sha256:...".
digest_rule strings (when present) are explanatory only; verifiers MUST compute digests using pinned schemas/profiles, not by parsing digest_rule.

Numeric conventions used in examples:

For weights and ratios in [0,1], export as basis points: x_bp = round(x * 10000).
For probabilities in [0,1], export as basis points: p_bp = round(p * 10000).
For very small probabilities, ppm is acceptable: p_ppm = round(p * 1_000_000).
If a clause uses a legal threshold with decimals, either:
- export value_scaled_int + explicit scale, or
- keep the raw numeric payload in a referenced artifact and export only its digest (*_ref / *_hash) plus display-only summaries.

Internal computation may still use floats; the convention here is about exported/hashed representations.

1. Why “evaluation as a goal surface”

Most systems treat “evaluation” as something outside the system:

metrics dashboards bolted onto production,
A/B platforms glued onto the side,
offline experiments that no one wires back into governance.

In SI-Core, evaluation is part of the same decision fabric:

Evaluators see goal surfaces and Jumps, just like decision engines.
Experiments are just another kind of Jump — with a different goal surface.
ETH and ID apply equally to “we’re only experimenting” as to “we’re shipping.”

This suggests a design principle:

EVAL is itself a Goal Surface, with its own ETH, ID, and PoLB constraints (within the Learning Boundary).

Concretely:

We define what we are optimizing (metrics),
for whom (principals / personas),
on which agents / roles (what is being evaluated),
under what safety constraints (ETH, PoLB).

Everything else (A/B, shadow, off-policy) are just different ways to explore that evaluation goal surface.

2. What EVAL is in SI-Core

We distinguish:

Metrics — raw measurements (loss, CTR, latency, RBL, SCI, fairness metrics…)
Goal surfaces — multi-objective functions over state/actions (GCS, constraints).
EVAL — the subsystem that:
- defines evaluation goal surfaces,
- designs experiments,
- measures outcomes,
- feeds back into model / Jump / policy updates.

At a high level:

OBS  →  Jumps / Policies   →  World Effects
           ▲                      │
           │                      ▼
        EVAL  ←  Metrics  ←  Logs / Traces / Shadow Runs

The key twist: EVAL is not “outside” this diagram. It is itself goal-native: it runs Jumps, respects ETH, and writes to MEM and ID.

3. Who is being evaluated? (Agents, roles, principals)

Before we talk about “A/B testing”, we must answer three questions:

Who is the subject of evaluation?
- A specific SI agent? (e.g. si:learning_companion:v2)
- A specific role? (e.g. role:teacher_delegate/reading_support)
- A policy bundle behind a Jump? (e.g. jump:learning.pick_next_exercise@v2.3.1)
For whom is this evaluation being done (principal)?
- A learner, a patient, a city resident, a regulator?
From which persona’s vantage point is this reported?
- Learner view, teacher view, ops view, regulator view?

We can encode this as an EvaluationSubject and EvaluationView:

evaluation_subject:
  kind: "jump_definition"         # or "si_agent" / "role" / "policy_bundle"
  id: "jump:learning.pick_next_exercise@v2.3.1"

evaluation_principal:
  id: "learner:1234"              # whose interests are primary

evaluation_view:
  persona_id: "persona:teacher_view"
  roles_involved:
    - "role:teacher"
    - "role:learning_companion"

EVAL should refuse vaguely specified experiments like “see which variant has higher engagement” with no subject and no principal. You must say:

“We are evaluating Jump version v2.3.1 vs v2.2.0 for learners of type X, under teacher/guardian/ops personas.”

4. Evaluation goal surfaces

We model an EvalSurface as a structured goal surface with:

objectives (metrics to maximize/minimize),
constraints (ETH, safety, fairness),
scope (who/where/when),
subject (what is being varied).

eval_surface:
  id: "eval:learning_exercise_selection/v1"
  created_at: "2028-04-01T00:00:00Z"  # operational time (advisory unless time is attested)
  as_of:
    time: "2028-04-01T00:00:00Z"
    clock_profile: "si/clock-profile/utc/v1"
  trust:
    trust_anchor_set_id: "si/trust-anchors/example/v1"
    trust_anchor_set_digest: "sha256:..."
  canonicalization: "si/jcs-strict/v1"
  canonicalization_profile_digest: "sha256:..."
  bindings:
    subject_jump:
      id: "jump:learning.pick_next_exercise"
      digest: "sha256:..."
  subject: "jump:learning.pick_next_exercise"
  scope:
    domain: "learning"
    population: "grade_5_reading_difficulties"
    context: "school_hours"
  objectives:
    primary:
      - name: "mastery_gain_7d_bp"
        weight_bp: 6000       # 0.60
      - name: "wellbeing_score_bp"
        weight_bp: 4000       # 0.40
    secondary:
      - name: "ops_cost_per_session_usd_micros"
        weight_bp: -1000      # -0.10 (minimize cost)
  constraints:
    hard:
      - "wellbeing_score_bp >= 7000"
      - "no_increase_in_flagged_distress_events == true"
      - "no_EthViolation(type='protected_group_harm')"
    soft:
      - "ops_cost_ratio_bp <= 12000"   # <= 1.20 * baseline
  roles:
    - "role:learning_companion"
    - "role:teacher_delegate"
  personas:
    report_to:
      - "persona:teacher_view"
      - "persona:ops_view"

This makes evaluation a first-class object:

EVAL Jumps optimize this eval surface when picking experiment designs.
ETH overlays can inspect it: “are you treating wellbeing as a primary goal?”
ID/MEM can trace: “who approved this eval surface?”

5. Experiments as Jumps (E-Jumps)

We treat experiments as a specific class of Jumps:

They are effectful w.r.t. world state (because they actually change assignments).
Their primary “action” is not a business action, but a policy decision:
- choose policy A vs B vs C for a given subject/context,
- decide assignment proportions, durations, PoLB boundaries.

5.1 Experiment Jump request

@dataclass
class ExperimentJumpRequest:
    eval_surface: EvalSurface
    subject: EvaluationSubject
    candidate_policies: list[PolicyVariant]  # e.g. Jump versions
    population: PopulationDefinition
    polb_config: PoLBConfig
    eth_overlay: ETHConfig
    role_persona: RolePersonaContext

5.2 Experiment Jump draft

@dataclass
class ExperimentJumpDraft:
    assignment_scheme: AssignmentScheme
    monitoring_plan: MonitoringPlan
    stop_rules: StopRuleSet
    eval_trace_contract: EvalTraceContract
    learning_boundary: LearningBoundarySpec

The core idea:

Experiment design = plan for how to move within the Learning Boundary, while respecting ETH and delivering signal on the eval surface.

6. Learning Boundary (LB) recap

The Learning Boundary (LB) is the structural fence between:

the “live world” (citizens, learners, patients, infrastructure…), and
the experimentation / learning machinery.

Very roughly:

World  ←→  LB  ←→  Policies / Models / SI agents

Key non-normative LB patterns for experiments:

Sandbox / historical (LB)
- world is frozen as logs; we can replay counterfactuals.
- No new risk to principals; ETH is easier.
Shadow (LB)
- new policy runs in parallel (“shadow”), no real effects.
- We compare recommendations to production decisions.
Online (LB)
- new policy actually affects the world.
- Needs ETH-aware constraints, role-aware accountability, and alive stop-rules.

E-Jumps (“Experiment Jumps”) must declare which PoLB mode they operate in (as the operational interface of the Learning Boundary), and ETH should treat these differently.

polb_config:
  envelope_mode: "online"                       # sandbox | shadow | online | degraded  (see art-60-038)
  mode_name: "ONLINE_EXPERIMENTAL_STRATIFIED"   # PoLB mode label (see art-60-038)
  max_risk_level: "medium"                      # low | medium | high | emergency
  rollout_strategy: "canary"                    # canary | random | stratified

  # Exported form avoids floats: ratios in [0,1] use basis points (bp).
  max_population_share_bp: 1000                 # 0.10

  guarded_by_eth: true

7. ETH-aware A/B and variant experiments

Now, given EvalSurface + PoLB, what does an ETH-aware A/B test look like?

7.1 ETH-aware experiment contract

experiment:
  id: "exp:learning_pick_next_exercise_v2_vs_v1"
  created_at: "2028-04-10T00:00:00Z"  # operational time (advisory unless time is attested)

  as_of:
    time: "2028-04-10T00:00:00Z"
    clock_profile: "si/clock-profile/utc/v1"

  trust:
    trust_anchor_set_id: "si/trust-anchors/example/v1"
    trust_anchor_set_digest: "sha256:..."

  canonicalization: "si/jcs-strict/v1"
  canonicalization_profile_digest: "sha256:..."

  # Pin meaning when exporting/hashing (avoid digest-only meaning).
  bindings:
    subject_jump:
      id: "jump:learning.pick_next_exercise"
      digest: "sha256:..."
    eval_surface:
      id: "eval:learning_exercise_selection/v1"
      digest: "sha256:..."

  subject: "jump:learning.pick_next_exercise"
  eval_surface: "eval:learning_exercise_selection/v1"

  polb_config:
    envelope_mode: "online"                       # sandbox | shadow | online | degraded  (see art-60-038)
    mode_name: "ONLINE_EXPERIMENTAL_STRATIFIED"   # can be domain-specific
    max_risk_level: "medium"                      # low | medium | high | emergency
    rollout_strategy: "stratified"                # canary | random | stratified
    max_population_share_bp: 2500                 # 0.25
    guarded_by_eth: true

  variants:
    control:
      policy: "jump:learning.pick_next_exercise@v1.9.0"
      traffic_share_bp: 7500                      # 0.75
    treatment:
      policy: "jump:learning.pick_next_exercise@v2.0.0"
      traffic_share_bp: 2500                      # 0.25
      
  eth_constraints:
    forbid:
      - "randomization_by_protected_attribute"
      - "higher_exposure_to_risky_content_for_vulnerable_learners"
    require:
      - "treatment_never_worse_than_control_for_wellbeing_on_avg"
      - "continuous_monitoring_of_distress_signals"
      - "immediate_abort_on_wellbeing_regression_bp > 500"  # 0.05

  stop_rules:
    min_duration_days: 14
    max_duration_days: 60
    early_stop_on:
      - "clear_superiority (p < 0.01) on mastery_gain_7d_bp"
      - "eth_violation_detected"

7.2 ETH + ID at assignment time

Assignment itself is a small Jump:

class VariantAssigner:
    def assign(self, principal, context, experiment):
        """
        experiment.variants is expected to match the exported contract form:
          variants:
            control:   {policy: "...", traffic_share_bp: 7500}
            treatment: {policy: "...", traffic_share_bp: 2500}
        """

        # 1) ETH / PoLB gating
        if not self.eth_overlay.permits_assignment(principal, context, experiment):
            control = experiment.variants["control"]
            return control["policy"], "eth_forced_control"

        # 2) Build traffic shares from the exported contract (basis points, no floats).
        shares_bp = {k: int(v["traffic_share_bp"]) for k, v in experiment.variants.items()}

        # 3) Deterministic draw (portable): randomizer SHOULD use a stable digest (e.g. sha256).
        variant_id = self.randomizer.draw_bp(
            principal_id=principal.id,
            experiment_id=experiment.id,
            shares_bp=shares_bp,
        )

        policy = experiment.variants[variant_id]["policy"]

        # 4) Log assignment with ID and MEM
        self.eval_trace.log_assignment(
            principal_id=principal.id,
            experiment_id=experiment.id,
            variant=variant_id,
            role_context=context.role_persona,
        )

        return policy, "assigned"

ETH-aware constraints might:

forbid certain principals from being randomized (e.g. high-risk medical cases),
force control for specific groups,
require stratified randomization to preserve fairness.

8. Shadow evaluation and off-policy evaluation

Not all evaluation must touch the live world.

8.1 Shadow Jumps

A shadow Jump runs a candidate policy in parallel with the real one:

uses the same OBS bundle, ID, Role/Persona context,
runs a Jump with RML budget set to NONE (no effectful ops),
compares decisions + GCS to production behavior.

Contract sketch:

shadow_eval:
  id: "shadow:city_flood_policy_v3"
  subject: "jump:city.adjust_flood_gates"

  polb_config:
    envelope_mode: "shadow"     # sandbox | shadow | online | degraded (see art-60-038)
    mode_name: "SHADOW_PROD"    # illustrative PoLB mode label (see art-60-038)
    rml_budget: "NONE"          # shadow eval must be non-effectful

  candidate_policy: "jump:city.adjust_flood_gates@v3.0.0"
  baseline_policy: "jump:city.adjust_flood_gates@v2.5.1"

  metrics:
    - "GCS_delta_safety"
    - "GCS_delta_cost"
    - "policy_disagreement_rate_bp"
    - "ETH_block_rate_bp"

Shadow evaluation is ideal for:

domain where online A/B is too risky (floodgates, ICU),
quick sanity checks (does ETH block the new policy constantly?),
feeding into more precise off-policy estimators.

8.2 Off-policy evaluation

Off-policy evaluation (OPE) uses historical logs to estimate how a new policy would have performed, without deploying it.

We won’t specify a particular estimator (IPW, DR, etc.), but we can define a portable interface shape:

class OffPolicyEvaluator:
    def evaluate(self, logs, candidate_policy, eval_surface):
        """Estimate candidate policy performance on eval_surface from logs."""
        estimates = []
        for log in logs:
            context = log.context
            action_taken = log.action
            outcome = log.outcome

            # Candidate's action
            candidate_action = candidate_policy.propose(context)

            # Importance weight / model-based correction
            w = self._importance_weight(log.behavior_policy_prob,
                                        candidate_policy.prob(context, candidate_action))

            contribution = self._eval_contribution(
                candidate_action, outcome, eval_surface
            )

            estimates.append(w * contribution)

        return aggregate_estimates(estimates)

Important invariants:

OPE must respect ID / MEM: you know which behavior policy generated each log.
ETH may forbid using some logs for certain kinds of off-policy queries (GDPR, consent).
OPE is a sandbox (LB) evaluation method — zero new live risk.

9. Role & Persona overlays for EVAL

We now combine with art-60-036.

9.1 Role-aware evaluations

Different roles may be subjects or context for evaluation:

Evaluate role:learning_companion vs role:teacher_delegate in a co-teaching scenario.
Evaluate multi-agent protocols (e.g. city_ops agent + flood_model agent).

We can annotate eval surfaces with explicit roles:

eval_surface:
  id: "eval:multi_agent_city_control/v1"
  subject:
    kind: "multi_agent_protocol"
    id: "proto:city_ops+flood_model@v1"

  roles_under_test:
    - "role:city_operator_ai"
    - "role:flood_model_ai"
  roles_observing:
    - "role:human_city_operator"

EVAL can then:

attribute metrics to each role (e.g. who proposed what),
reason about delegation chains: was a bad outcome due to mis-delegation or mis-execution?

9.2 Persona-aware reporting

The same experiment’s outcomes are rendered differently for different personas:

persona_views:
  learner_view:
    show_metrics:
      - "mastery_gain_7d"
      - "stress_load"
    explanation_style: "simple"

  teacher_view:
    show_metrics:
      - "mastery_gain_7d"
      - "curriculum_coverage"
      - "risk_flags"
    explanation_style: "technical"

  regulator_view:
    show_metrics:
      - "wellbeing_score"
      - "fairness_gap_metrics"
      - "policy_rollout_pattern"
    explanation_style: "regulatory"

EVAL is responsible for:

projecting metrics into persona views (using MetricProjector),
adapting explanations (using ExplanationAdapter),
ensuring that regulators see the structural evaluation setup (EvalSurface, PoLB, ETH), not just outcome metrics.

10. EvalTrace: auditing experiments and interventions

Just as Jumps produce JumpTrace, experiments produce EvalTrace.

eval_trace:
  experiment_id: "exp:learning_pick_next_exercise_v2_vs_v1"
  subject: "jump:learning.pick_next_exercise"
  eval_surface_id: "eval:learning_exercise_selection/v1"

  created_at: "2028-04-15T10:00:00Z"  # operational time (advisory unless time is attested)

  as_of:
    time: "2028-04-15T10:00:00Z"
    clock_profile: "si/clock-profile/utc/v1"

  trust:
    trust_anchor_set_id: "si/trust-anchors/example/v1"
    trust_anchor_set_digest: "sha256:..."

  canonicalization: "si/jcs-strict/v1"
  canonicalization_profile_digest: "sha256:..."

  bindings:
    experiment:
      id: "exp:learning_pick_next_exercise_v2_vs_v1"
      digest: "sha256:..."
    eval_surface:
      id: "eval:learning_exercise_selection/v1"
      digest: "sha256:..."

  assignments:
    - principal_id: "learner:1234"
      variant: "treatment"
      assigned_at: "2028-04-15T10:00:00Z"
      role_context: "role:learning_companion"
      delegation_chain: ["guardian:777", "teacher:42", "si:learning_companion"]
      randomization_seed_digest: "sha256:..."   # portable seed identity
      reason: "assigned"

  outcomes:
    window: "7d"
    metrics:
      treatment:
        mastery_gain_7d_bp: 2100
        wellbeing_score_bp: 8100
      control:
        mastery_gain_7d_bp: 1800
        wellbeing_score_bp: 8200

  ethics:
    eth_policy: "eth:learning_v4"
    violations_detected: []
    fairness_audit:
      demographic_breakdown:
        group_A: {...}
        group_B: {...}

  polb:
    envelope_mode: "online"
    mode_name: "ONLINE_EXPERIMENTAL_STRATIFIED"
    canary_phase:
      start: "2028-04-10"
      end: "2028-04-14"
      max_population_share_bp: 500   # 0.05
    ramp_up_plan:
      steps:
        - {date: "2028-04-20", share_bp: 1000}
        - {date: "2028-05-01", share_bp: 2500}

EvalTrace is what lets you answer:

Who designed this experiment?
On whose behalf?
With what constraints?
Did we actually follow our PoLB / ETH / stop-rules?

11. Testing strategies: “evaluate the evaluators”

EVAL itself can go wrong (p-hacking, unsafe randomization, mis-logged assignments). We need tests and governance.

11.1 Unit tests

EvalSurface construction:
- reject missing subject/principal,
- enforce ETH constraints present for high-risk domains.
Variant assignment:
- deterministic under fixed seeds,
- respects traffic shares,
- never randomizes on forbidden attributes.

11.2 Integration tests

Full pipeline: E-Jump → assignments → metrics → analysis.
Shadow evaluation: check that RML=NONE truly produces no effects.
PoLB: verify online experiments never exceed stated population share.

11.3 Property-based tests

Example: no ETH violation due to experiment:

from hypothesis import given

@given(context=gen_contexts(), principal=gen_principals())
def test_assignment_respects_eth(context, principal):
    exp = make_test_experiment()
    policy, reason = assigner.assign(principal, context, exp)

    assert not eth_overlay.is_forbidden_assignment(
        principal, context, exp, policy
    )

Example: ID consistency — every assignment, outcome, and metric entry can be tied back to the same principal and experiment.

12. Summary: EVAL as a first-class goal surface

In a Structured Intelligence stack, “evaluation” is not an afterthought or an external dashboard. It is:

a goal surface with explicit objectives and constraints,
scoped to subjects (agents, roles, Jump definitions),
defined for principals and rendered through personas,
executed via E-Jumps inside a Policy Learning Boundary,
guarded by ETH and Role/Persona overlays,
fully traceable in EvalTrace under MEM/ID.

This lets you say things like:

“We improved mastery gain by 3 points for learners like X, without harming wellbeing or fairness, via Jump v2.0.0, evaluated in a PoLB-constrained, ETH-approved experiment, and here is the trace that proves it.”

That is what “Evaluation as a Goal Surface” means in the context of SI-Core.

13. Experiment design algorithms

Challenge: An Experiment Jump (E-Jump) should not just say “do an A/B” — it should propose a statistically sound design: sample size, power, stopping rules, and monitoring, all consistent with the EvalSurface and PoLB.

13.1 Sample size calculation

A first layer is a SampleSizeCalculator that uses:

the primary objective on the EvalSurface,
a minimum detectable effect (MDE),
desired power and alpha,
variance estimates from historical data or simulations.

import numpy as np
from scipy.stats import norm

class SampleSizeCalculator:
    def calculate(
        self,
        eval_surface,
        effect_size,          # absolute difference you want to detect
        power: float = 0.8,
        alpha: float = 0.05,
        num_variants: int = 2,
    ):
        """Compute required sample size per variant for the primary metric (normal approx)."""
        if effect_size <= 0:
            raise ValueError("effect_size must be > 0")

        primary_metric = eval_surface.objectives.primary[0]

        variance = self._estimate_variance(
            primary_metric.name,
            eval_surface.scope.population,
        )
        if variance <= 0:
            raise ValueError("variance must be > 0")

        z_alpha = norm.ppf(1 - alpha / 2)
        z_beta = norm.ppf(power)

        # Two-arm z-approximation (common planning approximation)
        n_per_variant = 2 * variance * ((z_alpha + z_beta) / effect_size) ** 2

        return {
            "n_per_variant": int(np.ceil(n_per_variant)),
            "total_n": int(np.ceil(n_per_variant * num_variants)),
            "assumptions": {
                "effect_size": effect_size,
                "variance": variance,
                "power": power,
                "alpha": alpha,
                "primary_metric": primary_metric.name,
            },
        }

In SI-Core terms, this calculator is parameterized by the EvalSurface: the same infrastructure can be used for many experiments, but the primary metric, variance sources, and population are goal-surface–specific.

13.2 Power analysis

During or after an experiment, EVAL may want to ask:

“Given what we’ve seen so far, how much power do we actually have?”
“If we stop now, what’s the risk of a false negative?”

import numpy as np
from scipy.stats import t, nct

def power_analysis(eval_surface, observed_n, effect_size, alpha=0.05, variance=None):
    """
    Approx power for a two-sided two-sample t-test via noncentral t.
    If variance is not provided, this expects an external estimator.
    """
    if observed_n <= 2:
        return {"power": 0.0, "note": "observed_n too small for df"}

    if variance is None:
        primary_metric = eval_surface.objectives.primary[0].name
        variance = estimate_variance(primary_metric, eval_surface.scope.population)  # external hook

    if variance <= 0 or effect_size == 0:
        return {"power": 0.0, "note": "non-positive variance or zero effect_size"}

    cohen_d = effect_size / np.sqrt(variance)
    df = observed_n - 2

    # Two-sample: ncp ~ d * sqrt(n/2)
    ncp = cohen_d * np.sqrt(observed_n / 2)

    tcrit = t.ppf(1 - alpha / 2, df=df)

    # power = P(T > tcrit) + P(T < -tcrit) under noncentral t
    power = (1 - nct.cdf(tcrit, df=df, nc=ncp)) + nct.cdf(-tcrit, df=df, nc=ncp)

    return {"power": float(power), "df": int(df), "ncp": float(ncp), "tcrit": float(tcrit)}

This can be wired into monitoring and stop-rules: if power is clearly insufficient given the PoLB constraints, the system can recommend “stop for futility” or “extend duration if ETH permits.”

13.3 Early stopping: sequential testing and ETH

E-Jumps should be able to stop early:

for efficacy (treatment clearly better),
for futility (no realistic chance of detecting meaningful effects),
for harm (ETH / safety issues).

class SequentialTestingEngine:
    def __init__(self, alpha=0.05, spending_function="obrien_fleming"):
        self.alpha = alpha
        self.spending_function = spending_function

    def check_stop(self, experiment, current_data, analysis_number, max_analyses):
        """Sequential testing with alpha spending."""

        from scipy.stats import norm

        # O'Brien–Fleming-style alpha spending (illustrative)
        if self.spending_function == "obrien_fleming":
            z = norm.ppf(1 - self.alpha / 2)
            spent_alpha = self.alpha * (2 * (1 - norm.cdf(
                z / np.sqrt(analysis_number / max_analyses)
            )))
        else:
            spent_alpha = self.alpha / max_analyses  # fallback

        test_stat, p_value = self._compute_test_stat(
            current_data, experiment.eval_surface
        )

        # Stop for efficacy
        if p_value < spent_alpha:
            return StopDecision(
                stop=True,
                reason="efficacy",
                effect_estimate=self._estimate_effect(current_data),
                confidence_interval=self._ci(current_data, spent_alpha),
            )

        # Stop for futility
        if self._futility_check(current_data, experiment):
            return StopDecision(stop=True, reason="futility")

        # Stop for harm: ETH-aware check
        if self._harm_check(current_data, experiment.eth_constraints):
            return StopDecision(stop=True, reason="eth_violation_detected")

        return StopDecision(stop=False)

ETH constraints should be able to override purely statistical arguments: if harm is detected, you stop even if power is low or alpha is not yet “spent.”

13.4 E-Jump proposal algorithm

Putting it together, an E-Jump engine can propose a full design:

class ExperimentProposer:
    def propose_experiment(self, eval_surface, candidates, polb_config):
        """Propose an E-Jump design consistent with EvalSurface + PoLB."""
        # 1. Infer minimum detectable effect from domain norms
        effect_size = self._minimum_detectable_effect(eval_surface)

        # 2. Calculate required sample size
        sample_size = self.sample_size_calc.calculate(
            eval_surface,
            effect_size,
            power=0.8,
            alpha=0.05,
            num_variants=len(candidates)
        )

        # 3. Choose rollout strategy based on PoLB and risk
        if polb_config.envelope_mode == "online":
            if polb_config.max_risk_level in ("high", "emergency"):
                rollout = "canary"
                initial_share = 0.01
            else:
                rollout = "stratified"
                initial_share = 0.10
        else:
            rollout = "sandbox"   # naming aligned with envelope_mode taxonomy
            initial_share = 0.0

        assignment = self._design_assignment(
            population=eval_surface.scope.population,
            candidates=candidates,
            sample_size=sample_size,
            polb_config=polb_config,
            initial_share=initial_share,
        )

        # 4. Monitoring plan: which metrics, how often, with which alerts
        monitoring = MonitoringPlan(
            metrics=(
                [o.name for o in eval_surface.objectives.primary] +
                [c for c in eval_surface.constraints.hard]
            ),
            check_frequency="daily",
            alert_thresholds=self._derive_alert_thresholds(eval_surface),
        )

        # 5. Stop rules (statistical + ETH)
        stop_rules = self._design_stop_rules(eval_surface, sample_size)

        return ExperimentJumpDraft(
            assignment_scheme=assignment,
            monitoring_plan=monitoring,
            stop_rules=stop_rules,
            eval_trace_contract=self._eval_trace_contract(eval_surface),
            learning_boundary=polb_config.boundary,
            expected_duration=self._estimate_duration(
                sample_size, eval_surface.scope.population
            ),
            statistical_guarantees={"power": 0.8, "alpha": 0.05},
        )

E-Jumps, in this view, are planners over experiment space: they optimize not just “which variant,” but how to learn about which variant under PoLB/ETH constraints.

14. Multi-objective optimization in evaluation

Challenge: EvalSurfaces are multi-objective: mastery vs wellbeing, cost vs safety, etc. EVAL must reason about trade-offs both when designing experiments and when ranking outcomes.

14.1 Pareto-optimal experiment designs

Different experiment designs can trade off:

information gain on different metrics,
risk to participants,
cost and duration.

class ParetoExperimentOptimizer:
    def find_pareto_optimal_experiments(self, eval_surface, candidate_experiments):
        """Find Pareto-optimal experiment designs on multiple criteria."""
        evaluations = []
        for exp in candidate_experiments:
            scores = {}

            # Expected information gain per primary objective
            for obj in eval_surface.objectives.primary:
                scores[obj.name] = self._predict_info_gain(exp, obj)

            # Treat risk and cost as additional objectives
            scores["risk"] = self._assess_risk(exp, eval_surface)
            scores["cost"] = self._estimate_cost(exp)

            evaluations.append((exp, scores))

        # Compute Pareto frontier
        pareto_set = []
        for i, (exp_i, scores_i) in enumerate(evaluations):
            dominated = False
            for j, (exp_j, scores_j) in enumerate(evaluations):
                if i == j:
                    continue
                if self._dominates(scores_j, scores_i, eval_surface):
                    dominated = True
                    break
            if not dominated:
                pareto_set.append((exp_i, scores_i))

        return pareto_set

Governance can then choose among Pareto-optimal designs based on domain norms (e.g. “always prefer lower risk given similar information gain”).

14.2 Scalarization with constraint penalties

When a single scalar score is needed (e.g. for bandits or automated ranking), we can scalarize the multi-objective EvalSurface:

def weighted_scalarization(eval_surface, experiment_outcomes):
    """
    Convert multi-objective outcomes into a single scalar score.

    Portability note:
    - Prefer weight_bp in exported EvalSurfaces.
    - If only float weights exist (internal sketches), they are accepted.
    """
    score = 0.0

    # primary/secondary objectives (assumed structured)
    for obj in eval_surface.objectives.primary:
        w_bp = getattr(obj, "weight_bp", None)
        w = (float(w_bp) / 10000.0) if w_bp is not None else float(getattr(obj, "weight", 0.0))
        score += w * experiment_outcomes[obj.name]

    for obj in eval_surface.objectives.secondary:
        w_bp = getattr(obj, "weight_bp", None)
        w = (float(w_bp) / 10000.0) if w_bp is not None else float(getattr(obj, "weight", 0.0))
        score += w * experiment_outcomes[obj.name]

    # hard constraints (assumed strings/expr)
    for expr in eval_surface.constraints.hard:
        if not check_constraint(expr, experiment_outcomes):
            return -1e6  # hard fail

    # soft constraints: allow either
    #  - "expr string"
     #  - {expr: "...", penalty_weight: 0.1} or {expr: "...", penalty_weight_bp: 1000}
    for c in eval_surface.constraints.soft:
        if isinstance(c, str):
            expr = c
            penalty_weight = 1.0
        else:
            expr = c.get("expr") or c.get("constraint")
            if "penalty_weight_bp" in c:
                penalty_weight = float(c.get("penalty_weight_bp")) / 10000.0
            else:
                penalty_weight = float(c.get("penalty_weight", 1.0))

        if expr and (not check_constraint(expr, experiment_outcomes)):
            score -= penalty_weight

    return score

This is non-normative, but illustrates the shape: never silently turn hard constraints into soft preferences.

14.3 Multi-objective bandits

For continuous evaluation and adaptive experiments, we often need a bandit over policies that respects multi-objective EvalSurfaces.

class MultiObjectiveBandit:
    """Thompson sampling over scalarized multi-objective rewards."""

    def __init__(self, eval_surface, candidates):
        self.eval_surface = eval_surface
        self.candidates = candidates
        self.posteriors = {
            c.id: self._init_posterior() for c in candidates
        }

    def select_arm(self):
        """Sample from posteriors, then scalarize."""
        samples = {}
        for cand in self.candidates:
            objective_samples = {}
            for obj in self.eval_surface.objectives.primary:
                objective_samples[obj.name] = (
                    self.posteriors[cand.id][obj.name].sample()
                )

            samples[cand.id] = self._scalarize(
                objective_samples, self.eval_surface
            )

        # Candidate with highest sampled scalarized reward
        best_id = max(samples, key=samples.get)
        return next(c for c in self.candidates if c.id == best_id)

    def update(self, cand_id, outcomes):
        """Update per-objective posteriors from observed outcomes."""
        for obj in self.eval_surface.objectives.primary:
            self.posteriors[cand_id][obj.name].update(outcomes[obj.name])

14.4 Constraint handling strategies

Non-normative but useful patterns:

constraint_handling_strategies:
  hard_constraints:
    strategy: "Feasibility preservation"
    implementation: "Reject candidates violating constraints"

  soft_constraints:
    strategy: "Penalty method"
    implementation: "Add penalty term to scalarized objective"

  chance_constraints:
    strategy: "Probabilistic satisfaction"
    implementation: "Require Pr(constraint satisfied) >= threshold"

15. Continuous evaluation and adaptive experiments

Challenge: Static A/B tests are slow and wasteful when we could be doing continuous learning. But continuous bandits must still respect PoLB and ETH.

15.1 Multi-armed bandit integration

A simple BanditEvaluator wraps a MAB algorithm and plugs into EvalSurface:

class BanditEvaluator:
    """Continuous evaluation via multi-armed bandits."""

    def __init__(self, eval_surface, candidates, algorithm="thompson_sampling"):
        self.eval_surface = eval_surface
        self.candidates = candidates

        if algorithm == "thompson_sampling":
            self.bandit = ThompsonSamplingBandit(candidates)
        elif algorithm == "ucb":
            self.bandit = UCBBandit(candidates)
        else:
            raise ValueError(f"Unknown algorithm: {algorithm}")

    def run_episode(self, principal, context):
        """Single evaluation episode for one principal/context."""
        candidate = self.bandit.select_arm()

        # Execute Jump with selected candidate policy
        result = self.execute_jump(principal, context, candidate)

        # Measure outcome projected onto EvalSurface
        outcome = self.measure_outcome(result, self.eval_surface)

        # Update bandit posteriors
        self.bandit.update(candidate.id, outcome)

        return result

PoLB and ETH decide where bandits are allowed:

maybe allowed in low-risk learning UX,
forbidden in high-risk city or medical domains, or limited to shadow mode.

15.2 Contextual bandits (personalized evaluation)

For personalization, contextual bandits can choose variants based on features:

class ContextualBanditEvaluator:
    """Personalized evaluation with a contextual bandit."""

    def __init__(self, eval_surface, candidates, feature_extractor, priors):
        self.eval_surface = eval_surface
        self.candidates = candidates
        self.feature_extractor = feature_extractor
        self.posteriors = priors  # e.g., Bayesian linear models per candidate

    def select_candidate(self, context):
        features = self.feature_extractor.extract(context)

        samples = {}
        for cand in self.candidates:
            theta_sample = self.posteriors[cand.id].sample()
            samples[cand.id] = float(np.dot(theta_sample, features))

        best_id = max(samples, key=samples.get)
        return next(c for c in self.candidates if c.id == best_id)

    def update(self, cand_id, context, outcome):
        features = self.feature_extractor.extract(context)
        self.posteriors[cand_id].update(features, outcome)

Again, ETH/PoLB overlay must control:

which features are allowed (no sensitive attributes),
where context-driven adaptation is permitted.

15.3 Adaptive experiment design

An AdaptiveExperimentDesigner can reallocate traffic as evidence accrues:

class AdaptiveExperimentDesigner:
    """Adapt experiment allocations based on accumulated evidence."""

    def adapt_traffic_allocation(self, experiment, current_results):
        """Reallocate traffic to better-performing variants."""
        posteriors = {}
        for variant in experiment.variants:
            posteriors[variant.id] = self._compute_posterior(
                variant, current_results
            )

        # Probability each variant is best
        prob_best = {}
        for variant in experiment.variants:
            prob_best[variant.id] = self._prob_best(
                posteriors, variant.id
            )

        # Minimum allocation (e.g. 5%) to keep learning and avoid starvation
        new_allocations = {}
        for variant in experiment.variants:
            new_allocations[variant.id] = max(
                0.05,
                prob_best[variant.id],
            )

        # Normalize to sum to 1
        total = sum(new_allocations.values())
        return {k: v / total for k, v in new_allocations.items()}

Regret bounds (TS, UCB, contextual bandits) can be part of EvalSurface commentary (“we accept at most X regret over T steps for this domain”).

16. Causal inference for evaluation

Challenge: Even with experiments, we face confounding, selection bias, and heterogeneous effects. For off-policy evaluation and quasi-experiments, EVAL should expose a causal layer.

16.1 Basic causal effect estimation

A simple IPW-style CausalEvaluator:

class CausalEvaluator:
    """Causal inference utilities for evaluation."""

    def estimate_treatment_effect(self, data, treatment_var, outcome_var, covariates):
        """Estimate a causal effect adjusting for confounders."""
        from sklearn.linear_model import LogisticRegression

        # Propensity score model
        ps_model = LogisticRegression()
        ps_model.fit(data[covariates], data[treatment_var])
        propensity = ps_model.predict_proba(data[covariates])[:, 1]

        treated = data[treatment_var] == 1
        weights = np.where(treated, 1 / propensity, 1 / (1 - propensity))

        ate = (
            np.mean(weights[treated] * data.loc[treated, outcome_var]) -
            np.mean(weights[~treated] * data.loc[~treated, outcome_var])
        )

        return {
            "ate": ate,
            "std_error": self._bootstrap_se(
                data, treatment_var, outcome_var, weights
            ),
        }

This fits naturally with off-policy evaluation: logs already include behavior policy information (ID/MEM).

16.2 Heterogeneous treatment effects (HTE)

EVAL should be able to ask:

“For whom does this policy help or hurt?”

class HTEEstimator:
    """Estimate conditional average treatment effects (CATE)."""

    def estimate_cate(self, data, treatment, outcome, features):
        """Return a function mapping features → CATE estimate."""
        from sklearn.ensemble import RandomForestRegressor

        treated_data = data[data[treatment] == 1]
        control_data = data[data[treatment] == 0]

        model_treated = RandomForestRegressor().fit(
            treated_data[features], treated_data[outcome]
        )
        model_control = RandomForestRegressor().fit(
            control_data[features], control_data[outcome]
        )

        def cate(x):
            return (
                model_treated.predict([x])[0] -
                model_control.predict([x])[0]
            )

        return cate

ETH overlays can then enforce constraints like “no group’s CATE is significantly negative on wellbeing.”

16.3 Instrumental variables

For domains where randomization isn’t possible, instrumental variables provide another tool:

def instrumental_variable_estimation(data, instrument, treatment, outcome, controls):
    """Two-stage least squares (2SLS) estimation."""
    import statsmodels.api as sm
    import pandas as pd

    # First stage: treatment ~ instrument + controls
    first_stage = sm.OLS(
        data[treatment],
        sm.add_constant(data[[instrument] + controls])
    ).fit()
    treatment_hat = first_stage.fittedvalues

    # Second stage: outcome ~ treatment_hat + controls
    regressors = pd.DataFrame(
        {"treatment_hat": treatment_hat, **{c: data[c] for c in controls}}
    )
    second_stage = sm.OLS(
        data[outcome],
        sm.add_constant(regressors)
    ).fit()

    return {
        "effect": second_stage.params["treatment_hat"],
        "se": second_stage.bse["treatment_hat"],
        "first_stage_f": first_stage.fvalue,
    }

The point is not to prescribe a particular causal toolkit, but to make causal thinking a first-class part of EvalSurface design (e.g. fields like causal_assumptions, identification_strategy).

17. Performance and scalability of evaluation systems

Challenge: Evaluation infrastructure itself must scale: variant assignment, ETH checks, logging, and metric aggregation must fit tight latency budgets.

17.1 Scaling the assignment service

A ScalableAssignmentService for high QPS:

class ScalableAssignmentService:
    """Low-latency, high-throughput variant assignment."""

    def __init__(self):
        self.experiment_cache = ExperimentCache()   # experiment configs
        self.assignment_cache = AssignmentCache()   # deterministic assignments
        self.async_logger = AsyncLogger()

    def assign(self, principal_id, experiment_id, context):
        """Sub-10ms p99 assignment path."""

        # 1. Check deterministic assignment cache
        cached = self.assignment_cache.get(principal_id, experiment_id)
        if cached:
            return cached

        # 2. Load experiment config (usually from in-memory cache)
        experiment = self.experiment_cache.get(experiment_id)

        # 3. Fast, stateless assignment (no DB write on critical path)
        variant = self._fast_assign(principal_id, experiment)

        # 4. Async logging into EvalTrace / MEM
        self.async_logger.log_assignment(
            principal_id=principal_id,
            experiment_id=experiment_id,
            variant_id=variant.id,
            context=context,
        )

        return variant


    def _fast_assign(self, principal_id, experiment):
        """
        Deterministic assignment honoring traffic shares.

        Avoid language/runtime-dependent hash() (Python hash is salted per process).
        Use a stable digest (sha256) and basis-point buckets for portability.
        """
        import hashlib

        key = f"{principal_id}:{experiment.id}:{experiment.salt}".encode("utf-8")
        digest = hashlib.sha256(key).digest()

        # 0..9999 bucket (basis points)
        bucket = int.from_bytes(digest[:4], "big") % 10000

        cumulative_bp = 0
        for variant in experiment.variants:
            share_bp = getattr(variant, "traffic_share_bp", None)
            if share_bp is None:
                # Non-normative fallback (local conversion): prefer explicit *_bp in exported contracts.
                share = float(getattr(variant, "traffic_share", 0.0))
                share_bp = int(round(share * 10000))

            cumulative_bp += int(share_bp)
            if bucket < cumulative_bp:
                return variant

        return experiment.control  # fallback

17.2 Streaming metrics aggregation

For large experiments, metrics must be aggregated in streaming fashion:

class StreamingMetricsAggregator:
    """Real-time metrics aggregation with bounded memory."""

    def __init__(self):
        self.sketches = {}  # keyed by (experiment, variant, metric)

    def _key(self, experiment_id, variant_id, metric_name):
        return (experiment_id, variant_id, metric_name)

    def update(self, experiment_id, variant_id, metric_name, value):
        key = self._key(experiment_id, variant_id, metric_name)

        if key not in self.sketches:
            self.sketches[key] = self._init_sketch(metric_name)

        # e.g. t-digest or similar
        self.sketches[key].update(value)

    def query(self, experiment_id, variant_id, metric_name, quantile=0.5):
        key = self._key(experiment_id, variant_id, metric_name)
        return self.sketches[key].quantile(quantile)

17.3 Performance budgets

Non-normative but useful target budgets:

latency_budgets_p99:
  assignment_service: 10ms
  eth_check: 5ms
  metrics_logging: 2ms   # async

throughput_targets:
  assignments_per_second: 100000
  metrics_updates_per_second: 1000000

These should be treated as part of the EvalSurface for ops/persona: evaluation that is too slow can itself violate goals (latency, cost).

18. Experiment governance and approval processes

Challenge: Experiments — especially online and safety-critical ones — need structured governance, not ad-hoc “ship and watch the dashboards.”

18.1 Approval workflow

A non-normative experiment approval workflow:

experiment_approval_workflow:
  stage_1_proposal:
    required:
      - eval_surface
      - polb_config
      - eth_constraints
      - sample_size_justification
    submitted_by: "experiment_designer"

  stage_2_risk_assessment:
    assessor: "domain_expert + ethics_board_representative"
    criteria:
      - "PoLB mode appropriate for domain risk"
      - "Hard constraints cover key safety concerns"
      - "Max population share within policy limits"
      - "Stop rules adequate for ETH + statistics"
    outputs:
      - "risk_level: low|medium|high"
      - "required_reviewers"

  stage_3_review:
    low_risk:
      reviewers: ["technical_lead"]
      turnaround: "2 business days"
    medium_risk:
      reviewers: ["technical_lead", "domain_expert"]
      turnaround: "5 business days"
    high_risk:
      reviewers: ["technical_lead", "domain_expert", "ethics_board", "legal"]
      turnaround: "10 business days"
      requires: "formal_ethics_committee_approval"

  stage_4_monitoring:
    automated:
      - "stop_rule_checks"
      - "eth_violation_detection"
    human:
      - "weekly_review_for_high_risk"
      - "monthly_review_for_medium_risk"

18.2 Risk rubric

A simple risk assessor for experiments:

class ExperimentRiskAssessor:
    def assess_risk(self, experiment):
        """Coarse risk score for experiment governance (portable, bp-friendly)."""
        score = 0

        envelope_mode = getattr(experiment.polb_config, "envelope_mode", None)

        # PoLB envelope: online > shadow > sandbox/offline
        if envelope_mode == "online":
            score += 3
        elif envelope_mode == "shadow":
            score += 1
        elif envelope_mode == "sandbox":
            score += 0

        # Domain risk
        domain = getattr(getattr(experiment, "subject", None), "domain", None)
        if domain in ["medical", "city_critical", "finance"]:
            score += 3

        # Population share (export-friendly): basis points
        max_share_bp = getattr(experiment.polb_config, "max_population_share_bp", None)
        if max_share_bp is None:
            # Missing explicit bp fields is itself a governance risk (be conservative).
            score += 2
        else:
            if int(max_share_bp) > 2500:  # 0.25
                score += 2

        # Hard constraints present?
        hard = getattr(getattr(experiment, "eth_constraints", None), "hard", None)
        if not hard:
            score += 2
            
        # Vulnerable populations
        if self._involves_vulnerable_pop(experiment.population):
            score += 3
        if score <= 3:
            return "low"
        if score <= 7:
            return "medium"
        return "high"

Risk level can automatically drive:

required reviewers,
PoLB modes allowed,
extra logging / monitoring requirements.

18.3 Ethics committee integration

For high-risk experiments, you typically require ethics committee involvement:

ethics_committee_review:
  triggered_by:
    - "risk_level == high"
    - "involves_vulnerable_populations"
    - "novel_experimental_design"
  review_packet:
    - "Experiment proposal (EvalSurface + PoLB)"
    - "Risk assessment report"
    - "Informed consent procedures (if applicable)"
    - "Data handling and retention plan"
    - "Monitoring and stop rules"
    - "Post-experiment analysis and debrief plan"
  committee_decision:
    - "Approved as proposed"
    - "Approved with modifications"
    - "Deferred pending additional information"
    - "Rejected"

Because EvalSurfaces and E-Jumps are structurally defined, the governance layer can reason over them directly — rather than reading informal design docs — and MEM/ID can keep a durable record of exactly what was approved.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote