Scoring System#

Validation uses a two-layer system: status checks determine pass/fail, while scores provide a continuous 0–1 measure for optimization.

Two-Layer Validation#

Status checks (categorical):

  • PASS: Metric is within acceptable range

  • WARN: Metric is borderline (outside target but within tolerance)

  • FAIL: Metric significantly deviates from target

Scores (continuous, 0 to 1):

Each metric produces a score between 0.0 and 1.0. The total_score is a weighted average across all metrics. A simulation passes if it has zero FAIL-status metrics (n_fail == 0).

Weight-Based Fail Escalation#

High-weight metrics have stricter WARN→FAIL thresholds. The escalation multiplier is computed as:

\[m = \text{clamp}(5 - 2w, \; 0.5, \; 5.0)\]

where \(w\) is the metric weight. This multiplier scales the WARN→FAIL boundary in each check function.

Weight

Multiplier

Effect (MEAN_TOLERANCE, normal FAIL at 2× tol)

3.0

0.5

FAIL at 1× tolerance (stricter)

2.0

1.0

FAIL at 2× tolerance (normal)

1.5

2.0

FAIL at 4× tolerance

1.0

3.0

FAIL at 6× tolerance

0.5

4.0

FAIL at 8× tolerance (lenient)

BOOLEAN checks are exempt from escalation; they always have natural PASS/FAIL behavior.

Metric Types#

Type

How It Works

RANGE

Value must fall within [min, max] range

MEAN_TOLERANCE

Value must be within percentage of target

PCT_WITHIN

Percentage of time series within a band

OUTLIER

Penalizes extreme values in distribution

BOOLEAN

Binary pass/fail check (e.g., “economy did not collapse”)

Improvement Scoring (Buffer-Stock)#

The buffer-stock scenario uses additional scoring functions to measure improvement over the Growth+ baseline. For each Growth+ metric, the buffer-stock score delta is computed:

\[\Delta = s_{\text{buffer-stock}} - s_{\text{growth+}}\]

The improvement check uses a weight-aware degradation threshold:

\[t = \frac{b}{w}\]

where \(b\) is the base degradation threshold (default 0.25) and \(w\) is the metric weight. High-weight metrics tolerate less degradation (stricter), while low-weight metrics are more lenient.

This check is applied at the aggregate level (mean delta across all seeds in a stability test), not per seed. Averaging across seeds eliminates per-seed noise, allowing a strict threshold without false positives.

  • PASS: \(\Delta \geq 0\) (improved or unchanged)

  • WARN: \(|\Delta| \leq t\) (minor degradation within threshold)

  • FAIL: \(|\Delta| > t\) (significant degradation)

API Reference#

Scoring and status check functions for validation.

This module provides functions to compute scores and check statuses for validation metrics.

validation.scoring.fail_escalation_multiplier(weight)[source]#

Compute the fail-escalation multiplier from a metric’s weight.

The multiplier scales the WARN→FAIL boundary in status check functions. A multiplier < 1 shrinks the WARN zone (stricter), > 1 widens it (more lenient).

Mapping (with default constants):

weight 3.0 → 0.5 (FAIL at 0.5× normal threshold) weight 2.0 → 1.0 (normal behaviour) weight 1.5 → 2.0 (FAIL at 2× normal threshold) weight 1.0 → 3.0 (FAIL at 3× normal threshold) weight 0.5 → 4.0 (FAIL at 4× normal threshold)

Parameters:

weight (float) – Metric weight (typically 0.5–3.0).

Returns:

Escalation multiplier, clamped to [_FAIL_FLOOR, _FAIL_CEILING].

Return type:

float

validation.scoring.score_mean_tolerance(actual, target, tolerance)[source]#

Score from 0-1 based on distance from target.

Returns 1.0 if exactly on target, 0.5 at distance == tolerance, and 0.0 at distance >= 2 * tolerance (linear decay throughout).

validation.scoring.score_range(actual, min_val, max_val)[source]#

Score from 0-1 based on position relative to range.

Returns 0.75-1.0 if inside range (higher near center). Returns 0.0-0.75 if outside range (decays with distance).

validation.scoring.score_pct_within_target(actual_pct, target_pct, min_pct)[source]#

Score 0-1 for percentage meeting target.

Returns 1.0 if actual >= target, scores proportionally if >= min, and penalizes below min.

validation.scoring.score_outlier_penalty(outlier_pct, max_outlier_pct, penalty_weight=2.0)[source]#

Score 0-1 with exponential penalty for excessive outliers.

Returns 1.0 if outlier_pct <= max_outlier_pct, else exponentially decays based on how much the actual exceeds the maximum allowed.

validation.scoring.check_mean_tolerance(actual, target, tolerance, warn_multiplier=2.0, escalation=1.0)[source]#

Check if actual value is within tolerance of target.

Returns:

PASS if within tolerance WARN if within warn_multiplier * escalation * tolerance FAIL otherwise

validation.scoring.check_range(actual, min_val, max_val, warn_buffer=0.5, escalation=1.0)[source]#

Check if actual value is within range.

Returns:

PASS if within [min_val, max_val] WARN if within extended range (buffer * escalation applied) FAIL otherwise

validation.scoring.check_pct_within_target(actual_pct, target_pct, min_pct, escalation=1.0)[source]#

Check if percentage within target meets threshold.

With escalation, the WARN zone extends below min_pct proportionally to the original WARN-zone width (target_pct - min_pct).

Returns:

PASS if actual >= target_pct WARN if actual >= effective_min FAIL otherwise

validation.scoring.check_outlier_penalty(outlier_pct, max_outlier_pct, severe_multiplier=2.0, escalation=1.0)[source]#

Check if outlier percentage is within acceptable limits.

Returns:

PASS if outlier_pct <= max_outlier_pct WARN if outlier_pct <= max_outlier_pct * severe_multiplier * escalation FAIL otherwise

validation.scoring.check_improvement(delta, weight, max_degradation_base=0.1)[source]#

Check if a metric’s score delta indicates acceptable change.

Uses a weight-aware degradation threshold: high-weight metrics tolerate less degradation than low-weight ones.

Parameters:
  • delta (float) – Score delta (buffer_stock_score - growth_plus_score). Positive means improvement, negative means degradation.

  • weight (float) – Metric weight (from the Growth+ metric spec).

  • max_degradation_base (float) – Base degradation threshold. Actual threshold = base / weight.

Returns:

PASS if delta >= 0 (improved or same). WARN if degradation within threshold. FAIL if degradation exceeds threshold.

Return type:

Status

validation.scoring.score_improvement(delta)[source]#

Score from 0-1 based on improvement delta.

Returns max(0, min(1, 1 + delta)). Improvement (delta > 0) yields score close to 1.0. Degradation (delta < 0) penalizes toward 0.0.

Parameters:

delta (float) – Score delta (buffer_stock_score - growth_plus_score).

Returns:

Score in [0.0, 1.0].

Return type:

float

validation.scoring.compute_combined_score(stability)[source]#

Compute combined score balancing accuracy and stability.

Formula: mean_score * pass_rate * (1 - std_score) - Higher mean_score is better - Higher pass_rate is better - Lower std_score is better

Parameters:

stability (StabilityResult) – Result from run_stability_test().

Returns:

Combined score (higher is better).

Return type:

float

validation.scoring.worst_status(*statuses)[source]#

Return the most severe status from the given statuses.