Phase 3: Stability Testing & Ranking#

The final phase progressively narrows candidates through increasing numbers of seeds using a tiered tournament.

Tiered Tournament#

Default tiers: [(100, 10), (50, 20), (10, 100)]

  Tier 1: top 100 configs × 10 seeds → rank → keep top 50
  Tier 2: top 50 × +10 new seeds (20 total) → rank → keep top 10
  Tier 3: top 10 × +80 new seeds (100 total) → final ranking

Each tier only runs new seeds (incremental), accumulating all prior results. Total: 100×10 + 50×10 + 10×80 = 2,300 vs naive 100×100 = 10,000.

from calibration import run_tiered_stability

stable = run_tiered_stability(
    candidates=results[:20],
    scenario="baseline",
    n_workers=10,
)

for r in stable[:5]:
    print(f"Score: {r.mean_score:.4f} ± {r.std_score:.4f}")

Ranking Strategies#

Strategy	Formula	Best For
`combined`	`mean_score × (1 - k × std_score)`	Balance of quality and stability
`stability`	`pass_rate DESC, n_fail ASC, combined DESC`	Maximizing reproducibility
`mean`	`mean_score DESC`	Ignoring variance, best average

The --k-factor parameter (default 1.0) controls variance penalty in combined mode. Higher k penalizes high-variance configs more heavily.

python -m calibration --phase stability --rank-by stability --k-factor 1.5

Multi-Seed Guidance#

Single-seed screening is necessary for computational efficiency but overfits to the specific random draw. A config ranking 1st with seed=0 may rank 50th across 100 seeds. Recommendations:

Tier 1: At least 10 seeds for initial screening
Tier 2: 20–30 seeds for medium confidence
Final selection: 100+ seeds for publication-quality results

Multi-Pass Workflow#

For thorough calibration:

Run sensitivity → grid → stability for the primary parameters
Fix the now-calibrated parameters at their optimal values
Re-run the pipeline for previously fixed parameters
Compare results and iterate as needed