Phase 3: Stability Testing & Ranking#
The final phase progressively narrows candidates through increasing numbers of seeds using a tiered tournament.
Tiered Tournament#
Default tiers: [(100, 10), (50, 20), (10, 100)]
Tier 1: top 100 configs × 10 seeds → rank → keep top 50
Tier 2: top 50 × +10 new seeds (20 total) → rank → keep top 10
Tier 3: top 10 × +80 new seeds (100 total) → final ranking
Each tier only runs new seeds (incremental), accumulating all prior results. Total: 100×10 + 50×10 + 10×80 = 2,300 vs naive 100×100 = 10,000.
from calibration import run_tiered_stability
stable = run_tiered_stability(
candidates=results[:20],
scenario="baseline",
n_workers=10,
)
for r in stable[:5]:
print(f"Score: {r.mean_score:.4f} ± {r.std_score:.4f}")
Ranking Strategies#
Strategy |
Formula |
Best For |
|---|---|---|
|
|
Balance of quality and stability |
|
|
Maximizing reproducibility |
|
|
Ignoring variance, best average |
The --k-factor parameter (default 1.0) controls variance penalty in
combined mode. Higher k penalizes high-variance configs more heavily.
python -m calibration --phase stability --rank-by stability --k-factor 1.5
Multi-Seed Guidance#
Single-seed screening is necessary for computational efficiency but overfits to the specific random draw. A config ranking 1st with seed=0 may rank 50th across 100 seeds. Recommendations:
Tier 1: At least 10 seeds for initial screening
Tier 2: 20–30 seeds for medium confidence
Final selection: 100+ seeds for publication-quality results
Multi-Pass Workflow#
For thorough calibration:
Run sensitivity → grid → stability for the primary parameters
Fix the now-calibrated parameters at their optimal values
Re-run the pipeline for previously fixed parameters
Compare results and iterate as needed