Scoring System#
Validation uses a two-layer system: status checks determine pass/fail, while scores provide a continuous 0–1 measure for optimization.
Two-Layer Validation#
Status checks (categorical):
PASS: Metric is within acceptable range
WARN: Metric is borderline (outside target but within tolerance)
FAIL: Metric significantly deviates from target
Scores (continuous, 0 to 1):
Each metric produces a score between 0.0 and 1.0. The total_score is a
weighted average across all metrics. A simulation passes if it has zero
FAIL-status metrics (n_fail == 0).
Weight-Based Fail Escalation#
High-weight metrics have stricter WARN→FAIL thresholds. The escalation multiplier is computed as:
where \(w\) is the metric weight. This multiplier scales the WARN→FAIL boundary in each check function.
Weight |
Multiplier |
Effect (MEAN_TOLERANCE, normal FAIL at 2× tol) |
|---|---|---|
3.0 |
0.5 |
FAIL at 1× tolerance (stricter) |
2.0 |
1.0 |
FAIL at 2× tolerance (normal) |
1.5 |
2.0 |
FAIL at 4× tolerance |
1.0 |
3.0 |
FAIL at 6× tolerance |
0.5 |
4.0 |
FAIL at 8× tolerance (lenient) |
BOOLEAN checks are exempt from escalation; they always have natural PASS/FAIL behavior.
Metric Types#
Type |
How It Works |
|---|---|
|
Value must fall within [min, max] range |
|
Value must be within percentage of target |
|
Percentage of time series within a band |
|
Penalizes extreme values in distribution |
|
Binary pass/fail check (e.g., “economy did not collapse”) |
Improvement Scoring (Buffer-Stock)#
The buffer-stock scenario uses additional scoring functions to measure improvement over the Growth+ baseline. For each Growth+ metric, the buffer-stock score delta is computed:
The improvement check uses a weight-aware degradation threshold:
where \(b\) is the base degradation threshold (default 0.25) and \(w\) is the metric weight. High-weight metrics tolerate less degradation (stricter), while low-weight metrics are more lenient.
This check is applied at the aggregate level (mean delta across all seeds in a stability test), not per seed. Averaging across seeds eliminates per-seed noise, allowing a strict threshold without false positives.
PASS: \(\Delta \geq 0\) (improved or unchanged)
WARN: \(|\Delta| \leq t\) (minor degradation within threshold)
FAIL: \(|\Delta| > t\) (significant degradation)
API Reference#
Scoring and status check functions for validation.
This module provides functions to compute scores and check statuses for validation metrics.
- validation.scoring.fail_escalation_multiplier(weight)[source]#
Compute the fail-escalation multiplier from a metric’s weight.
The multiplier scales the WARN→FAIL boundary in status check functions. A multiplier < 1 shrinks the WARN zone (stricter), > 1 widens it (more lenient).
- Mapping (with default constants):
weight 3.0 → 0.5 (FAIL at 0.5× normal threshold) weight 2.0 → 1.0 (normal behaviour) weight 1.5 → 2.0 (FAIL at 2× normal threshold) weight 1.0 → 3.0 (FAIL at 3× normal threshold) weight 0.5 → 4.0 (FAIL at 4× normal threshold)
- validation.scoring.score_mean_tolerance(actual, target, tolerance)[source]#
Score from 0-1 based on distance from target.
Returns 1.0 if exactly on target, 0.5 at distance == tolerance, and 0.0 at distance >= 2 * tolerance (linear decay throughout).
- validation.scoring.score_range(actual, min_val, max_val)[source]#
Score from 0-1 based on position relative to range.
Returns 0.75-1.0 if inside range (higher near center). Returns 0.0-0.75 if outside range (decays with distance).
- validation.scoring.score_pct_within_target(actual_pct, target_pct, min_pct)[source]#
Score 0-1 for percentage meeting target.
Returns 1.0 if actual >= target, scores proportionally if >= min, and penalizes below min.
- validation.scoring.score_outlier_penalty(outlier_pct, max_outlier_pct, penalty_weight=2.0)[source]#
Score 0-1 with exponential penalty for excessive outliers.
Returns 1.0 if outlier_pct <= max_outlier_pct, else exponentially decays based on how much the actual exceeds the maximum allowed.
- validation.scoring.check_mean_tolerance(actual, target, tolerance, warn_multiplier=2.0, escalation=1.0)[source]#
Check if actual value is within tolerance of target.
- Returns:
PASS if within tolerance WARN if within warn_multiplier * escalation * tolerance FAIL otherwise
- validation.scoring.check_range(actual, min_val, max_val, warn_buffer=0.5, escalation=1.0)[source]#
Check if actual value is within range.
- Returns:
PASS if within [min_val, max_val] WARN if within extended range (buffer * escalation applied) FAIL otherwise
- validation.scoring.check_pct_within_target(actual_pct, target_pct, min_pct, escalation=1.0)[source]#
Check if percentage within target meets threshold.
With escalation, the WARN zone extends below
min_pctproportionally to the original WARN-zone width(target_pct - min_pct).- Returns:
PASS if actual >= target_pct WARN if actual >= effective_min FAIL otherwise
- validation.scoring.check_outlier_penalty(outlier_pct, max_outlier_pct, severe_multiplier=2.0, escalation=1.0)[source]#
Check if outlier percentage is within acceptable limits.
- Returns:
PASS if outlier_pct <= max_outlier_pct WARN if outlier_pct <= max_outlier_pct * severe_multiplier * escalation FAIL otherwise
- validation.scoring.check_improvement(delta, weight, max_degradation_base=0.1)[source]#
Check if a metric’s score delta indicates acceptable change.
Uses a weight-aware degradation threshold: high-weight metrics tolerate less degradation than low-weight ones.
- Parameters:
- Returns:
PASS if delta >= 0 (improved or same). WARN if degradation within threshold. FAIL if degradation exceeds threshold.
- Return type:
Status
- validation.scoring.score_improvement(delta)[source]#
Score from 0-1 based on improvement delta.
Returns
max(0, min(1, 1 + delta)). Improvement (delta > 0) yields score close to 1.0. Degradation (delta < 0) penalizes toward 0.0.
- validation.scoring.compute_combined_score(stability)[source]#
Compute combined score balancing accuracy and stability.
Formula: mean_score * pass_rate * (1 - std_score) - Higher mean_score is better - Higher pass_rate is better - Lower std_score is better
- Parameters:
stability (
StabilityResult) – Result from run_stability_test().- Returns:
Combined score (higher is better).
- Return type: