Phase 3: Stability Testing & Ranking#

The final phase progressively narrows candidates through increasing numbers of seeds using a tiered tournament.

Tiered Tournament#

Default tiers: [(100, 10), (50, 20), (10, 100)]

  Tier 1: top 100 configs × 10 seeds → rank → keep top 50
  Tier 2: top 50 × +10 new seeds (20 total) → rank → keep top 10
  Tier 3: top 10 × +80 new seeds (100 total) → final ranking

Each tier only runs new seeds (incremental), accumulating all prior results. Total: 100×10 + 50×10 + 10×80 = 2,300 vs naive 100×100 = 10,000.

from calibration import run_tiered_stability

stable = run_tiered_stability(
    candidates=results[:20],
    scenario="baseline",
    n_workers=10,
)

for r in stable[:5]:
    print(f"Score: {r.mean_score:.4f} ± {r.std_score:.4f}")

Ranking Strategies#

Strategy

Formula

Best For

combined

mean_score × (1 - k × std_score)

Balance of quality and stability

stability

pass_rate DESC, n_fail ASC, combined DESC

Maximizing reproducibility

mean

mean_score DESC

Ignoring variance, best average

The --k-factor parameter (default 1.0) controls variance penalty in combined mode. Higher k penalizes high-variance configs more heavily.

python -m calibration --phase stability --rank-by stability --k-factor 1.5

Multi-Seed Guidance#

Single-seed screening is necessary for computational efficiency but overfits to the specific random draw. A config ranking 1st with seed=0 may rank 50th across 100 seeds. Recommendations:

  • Tier 1: At least 10 seeds for initial screening

  • Tier 2: 20–30 seeds for medium confidence

  • Final selection: 100+ seeds for publication-quality results

Multi-Pass Workflow#

For thorough calibration:

  1. Run sensitivity → grid → stability for the primary parameters

  2. Fix the now-calibrated parameters at their optimal values

  3. Re-run the pipeline for previously fixed parameters

  4. Compare results and iterate as needed