Performance & Profiling#

This guide covers performance analysis, profiling, and optimization for developers and power users extending BAM Engine.

Note

For user-facing performance expectations and configuration guidance, see Performance & Scaling.

Optimization Philosophy#

Before optimizing, follow these principles:

Readability first: Clear code is maintainable code. Don’t sacrifice clarity for marginal performance gains.
Algorithm before implementation: A better algorithm beats micro-optimization. Review the literature before investing in code-level optimization.
Profile before optimizing: Measure actual bottlenecks, not assumptions. The critical path is often surprising.
Verify improvements: Always benchmark before and after changes to confirm the optimization works as expected.

Benchmark Results#

Current benchmarks (Apple M4 Pro, macOS 15.2, Python 3.13):

Configuration	Firms	Households	Banks	1000 periods	Throughput
Small	100	500	10	2.1s	475 periods/s
Medium	200	1,000	10	4.4s	225 periods/s
Large	500	2,500	10	13.6s	73 periods/s

Performance scales approximately linearly with agent count. While NumPy vectorization is highly efficient, the per-agent computation cost means doubling the number of agents roughly doubles simulation time.

For historical benchmark tracking across commits, see the ASV Benchmark Dashboard.

Profiling#

cProfile#

Generate function-level timing breakdown using the built-in profiling script:

python benchmarks/profile_simulation.py

This runs a 1000-period simulation and outputs:

Top 30 functions by cumulative time (including subcalls)
Top 30 functions by self time (time in function itself)
Binary profile saved to benchmarks/simulation_profile.prof

Visualize the profile interactively with snakeviz:

pip install snakeviz
snakeviz benchmarks/simulation_profile.prof

IPython %prun#

For quick profiling in interactive sessions:

import bamengine as bam
sim = bam.Simulation.init(seed=42)

# Overall profile sorted by cumulative time
%prun -s cumulative sim.run(100)

# Profile a single step
%prun -s tottime sim.step()

Line-level Profiling#

For detailed line-by-line analysis, use line_profiler:

pip install line_profiler

In IPython/Jupyter:

%load_ext line_profiler

# Profile the step method
%lprun -f sim.step sim.run(10)

# Profile a specific internal function
from bamengine.events._internal.goods_market import consumers_decide_firms_to_visit
%lprun -f consumers_decide_firms_to_visit sim.run(10)

Memory Profiling#

Track memory usage with memory_profiler:

pip install memory_profiler

In IPython/Jupyter:

%load_ext memory_profiler

# Peak memory for a run
%memit sim.run(100)

# Line-by-line memory usage
%mprun -f sim.step sim.run(10)

ASV Benchmarking#

BAM Engine uses ASV (Airspeed Velocity) for automated performance tracking with machine-specific baselines.

Running Benchmarks#

cd asv_benchmarks

# Benchmark current commit
asv run

# Compare two commits
asv continuous HEAD~1 HEAD

# Compare specific benchmark
asv continuous -b SimulationSuite HEAD~1 HEAD

# Generate and view HTML report
asv publish && asv preview

Benchmark Suites#

The ASV configuration includes seven benchmark suites:

SimulationSuite: Full simulation runs (100/1000 periods) across small/medium/large
PipelineSuite: Single step performance (all events)
MemorySuite: Peak memory during initialization and simulation
CriticalEventSuite: Individual event benchmarks for the critical path (goods/labor/credit markets)
InitSuite: Initialization costs across different scales (100-1000 firms)
LoanBookSuite: Sparse relationship operations (append, aggregate, purge)
ScalingSuite: Performance scaling analysis with agent count (50-400 firms)

Quick Benchmarks (pytest-benchmark)#

For fast local benchmarking during development, use the pytest-benchmark tests:

# Run all quick benchmarks
pytest tests/performance/test_quick_benchmarks.py -v

# With detailed statistics
pytest tests/performance/test_quick_benchmarks.py -v --benchmark-verbose

# Save baseline for comparison
pytest tests/performance/test_quick_benchmarks.py --benchmark-save=baseline

# Compare against saved baseline
pytest tests/performance/test_quick_benchmarks.py --benchmark-compare

These benchmarks cover core operations (single step, initialization) and critical events (goods/labor market operations) with a small configuration for quick feedback.

Regression Testing#

Performance regression tests run automatically as part of the pytest suite. Coverage instrumentation is automatically disabled for performance tests (see tests/performance/conftest.py) to avoid measurement distortion from sys.settrace overhead, which can inflate timings by 40-60%.

# Run regression tests (included in full pytest runs)
pytest -m regression

# Run only performance tests
pytest tests/performance/ -v

These tests ensure performance doesn’t degrade beyond 15% of established baselines. Baselines are machine-specific and must be updated manually after confirmed improvements.

ASV benchmarks complement these tests with cross-commit tracking and machine-specific baselines, but require separate invocation and do not integrate into pytest.

Seed Stability Benchmarking#

The benchmarks/bench_seed_stability.py script runs large-scale seed stability analysis: 3 scenarios × 1000 seeds parallelized across 10 workers, producing JSON result files for the bamengine.org stability dashboard.

# Current working tree (all scenarios)
PYTHONPATH=src python benchmarks/bench_seed_stability.py

# Single scenario
PYTHONPATH=src python benchmarks/bench_seed_stability.py --scenario baseline

# Customize seeds/workers
PYTHONPATH=src python benchmarks/bench_seed_stability.py --seeds 500 --workers 8

# Historical commits via git worktrees
python benchmarks/bench_seed_stability.py --tags v0.5.0..v0.6.2
python benchmarks/bench_seed_stability.py --commits HEAD~5..HEAD

# Preview without running
PYTHONPATH=src python benchmarks/bench_seed_stability.py --dry-run

Results are saved as JSON in benchmarks/results/ and committed to the repository for the validation-status CI workflow. Each file includes per-seed pass/fail data, per-metric statistics, and git metadata. Results are also copied to bamengine.org/data/stability/ for the stability dashboard.

Architecture Performance Notes#

Key optimizations in BAM Engine’s architecture:

Vectorization#

All agent-level operations use NumPy vectorized operations. Avoid Python loops over agents:

from bamengine import ops

# Good: vectorized (fast)
ops.assign(role.price, ops.multiply(role.price, 1.1))

# Bad: Python loop (slow)
for i in range(len(role.price)):
    role.price[i] *= 1.1

Pre-allocation#

Fixed-size arrays are allocated at initialization. Avoid dynamic allocation during simulation:

# Good: use pre-allocated scratch buffer
ops.assign(role.scratch_buffer, computed_values)

# Bad: allocate new array each step
role.scratch_buffer = np.zeros(n_agents)

Sparse Relationships#

The LoanBook uses COO (Coordinate List) sparse format for memory efficiency:

Storage: O(active_loans) instead of O(n_firms × n_banks)
Efficient append: amortized O(1) with capacity doubling
Vectorized aggregation: np.bincount() and np.add.at()

Efficient Sorting#

Use argpartition instead of argsort when only top-k elements are needed:

from bamengine.utils import select_top_k_indices_sorted

# Get top 10 indices efficiently
top_k = select_top_k_indices_sorted(values, k=10)

Critical Path#

The market matching system (labor, credit, goods markets) contains the primary bottlenecks. All three markets now use vectorized batch processing:

Goods market: consumers_decide_firms_to_visit is fully vectorized using batch random sampling and 2D array operations. goods_market_round uses sequential processing: each consumer completes all Z visits before the next starts. The inner loop uses Python lists for performance (avoiding NumPy per-element overhead), adding ~1-4% to total simulation time at all scales.
Labor market: workers_decide_firms_to_apply uses vectorized firm selection. labor_market_round processes all applications simultaneously with batch conflict resolution via resolve_conflicts().
Credit market: credit_market_round uses grouped_cumsum for vectorized supply exhaustion, processing applications in ascending fragility order.