Phase 3: Stability Testing & Ranking ====================================== The final phase progressively narrows candidates through increasing numbers of seeds using a tiered tournament. Tiered Tournament ----------------- :: Default tiers: [(100, 10), (50, 20), (10, 100)] Tier 1: top 100 configs × 10 seeds → rank → keep top 50 Tier 2: top 50 × +10 new seeds (20 total) → rank → keep top 10 Tier 3: top 10 × +80 new seeds (100 total) → final ranking Each tier only runs **new** seeds (incremental), accumulating all prior results. Total: 100×10 + 50×10 + 10×80 = 2,300 vs naive 100×100 = 10,000. .. code-block:: python from calibration import run_tiered_stability stable = run_tiered_stability( candidates=results[:20], scenario="baseline", n_workers=10, ) for r in stable[:5]: print(f"Score: {r.mean_score:.4f} ± {r.std_score:.4f}") Ranking Strategies ------------------ .. list-table:: :header-rows: 1 :widths: 15 45 40 * - Strategy - Formula - Best For * - ``combined`` - ``mean_score × (1 - k × std_score)`` - Balance of quality and stability * - ``stability`` - ``pass_rate DESC, n_fail ASC, combined DESC`` - Maximizing reproducibility * - ``mean`` - ``mean_score DESC`` - Ignoring variance, best average The ``--k-factor`` parameter (default 1.0) controls variance penalty in ``combined`` mode. Higher k penalizes high-variance configs more heavily. .. code-block:: bash python -m calibration --phase stability --rank-by stability --k-factor 1.5 Multi-Seed Guidance ------------------- Single-seed screening is necessary for computational efficiency but **overfits** to the specific random draw. A config ranking 1st with seed=0 may rank 50th across 100 seeds. Recommendations: - **Tier 1**: At least 10 seeds for initial screening - **Tier 2**: 20–30 seeds for medium confidence - **Final selection**: 100+ seeds for publication-quality results Multi-Pass Workflow ------------------- For thorough calibration: 1. Run sensitivity → grid → stability for the primary parameters 2. Fix the now-calibrated parameters at their optimal values 3. Re-run the pipeline for previously fixed parameters 4. Compare results and iterate as needed