Dynamical-30B-A3B
A trained judgment layer for autonomous scientific workflows. Starting from Qwen/Qwen3-30B-A3B-Instruct-2507, this model was trained with multi-turn reinforcement learning in verified campaign environments to improve sequential decision-making under uncertainty at the planner-verifier boundary.
The target capability is not general reasoning or autonomous science. It is the decision-making core that determines whether a larger scientific system behaves intelligently when search, evidence, cost, and belief updates are all coupled: selecting which candidate to investigate, evaluating whether evidence should be trusted or escalated, and revising hypotheses as conflicting results accumulate.
Release Links
- Paper PDF: Training Scientific Judgment with Verified Environments for Autonomous Science
- Blog post: Training Scientific Judgment
- Public repo: Dynamical-Systems-Research/training-scientific-judgment
- Released evaluation bundle: repo
data/open_world/ - Search assets:
Dynamical-Systems/crystalite-base,Dynamical-Systems/crystalite-balanced
This model is the released scientific-judgment policy used in the final paper and blog post. The associated Crystalite checkpoints are released as supporting search-side assets for the open-world campaign provenance. The default public reproducibility path uses the frozen serialized campaign bundle from the public repo.
Training
Base model: Qwen3-30B-A3B-Instruct-2507 (30B total / 3B active MoE)
Stage 1 -- Supervised fine-tuning: LoRA rank 32 on 3,861 rollout rows from GPT-5.4 teacher demonstrations across 128 closed-world environments. SFT establishes the action contract and baseline scientific behavior.
Stage 2 -- Trajectory GRPO on open-world environments: 247 training environments (741 entries across 3 seeds), 100 gradient steps with G=4. The RL stage teaches experiential decision-making: when to search vs. exploit existing candidates, when to trust vs. escalate evidence, and when to revise beliefs.
The curriculum uses a staged bucket schedule over environment difficulty:
- Steps 0--9: anchor-heavy warmup (75% anchor, 20% challenge)
- Steps 10--24: broadening (40% anchor, 30% challenge, 25% stress-discriminative)
- Steps 25--100: hard transfer regime (25% anchor, 30% challenge, 35% stress-discriminative, 10% stress-search-fragile)
Budget tiers expand across stages from (5, 8) to the full range (5, 8, 11, 14, 18).
Reward: Hybrid oracle + rubric. A physics-grounded oracle (thermodynamic stability from the MP2020 convex hull) provides manipulation-proof outcome reward. Rubric-based process supervision scores search quality, validation discipline, escalation efficiency, and revision quality via Gemini 3.1 Pro, gated on correctness.
This release corresponds to the step-100 merged checkpoint.
Results
Primary metric: hypothesis accuracy -- the fraction of episodes where the model's highest-posterior hypothesis matches the oracle ground truth after all evidence rounds.
The final public release is paired with a frozen open-world bundle containing 300 serialized campaigns overall, with the primary paper evaluation reported on the pruned reachable held-out set of 29 campaigns.
Held-out learning curve (29 open-world environments, pass@1)
| Checkpoint | Hypothesis Accuracy | Mean Reward | Parse Rate |
|---|---|---|---|
| Step 0 (SFT baseline) | 55.17% (16/29) | 0.3404 | 0.978 |
| Step 25 | 65.52% (19/29) | 0.4076 | 0.982 |
| Step 50 | 68.97% (20/29) | 0.4460 | 0.993 |
| Step 75 | 79.31% (23/29) | 0.4801 | 0.981 |
| Step 100 | 79.31% (23/29) | 0.5001 | 0.983 |
Total gain from SFT baseline: +24.14 percentage points across 100 RL steps. Step 100 matches step 75 in accuracy but achieves +4.2% higher reward and 33% lower truncation, indicating more efficient trajectories at the same correctness level.
Per-round accuracy (survival curve)
RL fundamentally changes how accuracy relates to investigation depth. The SFT model's accuracy is flat-to-declining across campaign rounds (0.55 at round 0, dropping to 0.33 by round 10+). The RL model's accuracy increases monotonically with depth: 0.79 at round 0, rising to 1.0 by round 7. Episodes that investigate deeper are the ones that converge to the correct answer.
What RL Teaches
Behavioral analysis of the trained policy reveals three learned capabilities absent in the SFT baseline.
Search-then-exploit. The RL model uses skip_search in 76.5% of search turns, learning to search once or twice to populate the candidate pool, then exploit existing candidates rather than wasting budget on redundant search. High-reward trajectories show strategic hypothesis switching -- searching for one system, pivoting to another, then returning.
Suspect-then-decide validation. The clearest RL signature is learned evidence escalation. Rather than blanket-rejecting all evidence (the SFT default), the RL model develops a two-step pattern: flag evidence as suspect to gather additional signal, then commit to trust or reject. Among the highest-reward trajectories, 100% use this pattern; among the lowest-reward, 0% do.
Revision as a failure signal, not a tool. In the SFT model, belief revision correlates with incorrect outcomes (wrong episodes average 2.1 revise turns vs 0.4 for correct). After RL, the 10 episodes that flipped from wrong to correct reduced their revision turns from 1.8 to 0.6. RL eliminates unnecessary revision by improving upstream search and validation, so the model arrives at the correct hypothesis without needing to update beliefs.
Intended Use
This model is intended for research use in:
- Autonomous science agents requiring structured decision-making
- Budget-constrained experimental planning and evidence triage
- Multi-step scientific judgment under uncertainty
- Studies of post-training for sequential decision-making
It operates as a judgment policy within a larger workflow that provides candidate generation, structured evidence, and domain-specific verification oracles.
Limitations
- Single training seed. Results are from one training run. The learning curve shape (acceleration at steps 50--75) has not been validated across seeds.
- Evaluation scale. The primary evaluation uses 29 held-out environments with pass@1. Per-budget and per-difficulty analyses have limited statistical power.
- Staged verifier. Cheap and medium evidence stages are transforms of E_hull, not independent physical measurements.
- Persistent failures. 6 of 29 held-out episodes remain incorrect. Failure modes include budget exhaustion on sparse search pools, RL-induced regressions on easy specifications, and contaminated evidence that passes the suspect filter.
- Not a substitute for expert review. This model should not replace physical simulation, laboratory verification, or domain expertise in high-stakes scientific decisions.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Dynamical-Systems/Dynamical-30B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{
"role": "user",
"content": "Given these candidates and staged measurements, which one should we validate next, and should we trust or escalate the current evidence?"
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
Citation
@misc{barnes2026trainingscientificjudgment,
title={Training Scientific Judgment with Verified Environments for Autonomous Science},
author={Jarrod Barnes},
year={2026},
note={Technical report preprint}
}
Acknowledgments
Dynamical-30B-A3B is derived from Qwen/Qwen3-30B-A3B-Instruct-2507 and retains the upstream Apache 2.0 license. Training infrastructure provided by Tinker.
- Downloads last month
- 766