Performance & Profiling
=======================
This guide covers performance analysis, profiling, and optimization for developers
and power users extending BAM Engine.
.. note::
For user-facing performance expectations and configuration guidance, see
:ref:`performance-and-scaling`.
Optimization Philosophy
-----------------------
Before optimizing, follow these principles:
1. **Readability first**: Clear code is maintainable code. Don't sacrifice clarity
for marginal performance gains.
2. **Algorithm before implementation**: A better algorithm beats micro-optimization.
Review the literature before investing in code-level optimization.
3. **Profile before optimizing**: Measure actual bottlenecks, not assumptions.
The critical path is often surprising.
4. **Verify improvements**: Always benchmark before and after changes to confirm
the optimization works as expected.
Benchmark Results
-----------------
Current benchmarks (Apple M4 Pro, macOS 15.2, Python 3.13):
.. list-table::
:header-rows: 1
:widths: 15 10 15 10 15 15
* - Configuration
- Firms
- Households
- Banks
- 1000 periods
- Throughput
* - Small
- 100
- 500
- 10
- 2.1s
- 475 periods/s
* - Medium
- 200
- 1,000
- 10
- 4.4s
- 225 periods/s
* - Large
- 500
- 2,500
- 10
- 13.6s
- 73 periods/s
Performance scales approximately linearly with agent count. While NumPy vectorization
is highly efficient, the per-agent computation cost means doubling the number of
agents roughly doubles simulation time.
For historical benchmark tracking across commits, see the
`ASV Benchmark Dashboard `_.
Profiling
---------
cProfile
~~~~~~~~
Generate function-level timing breakdown using the built-in profiling script:
.. code-block:: bash
python benchmarks/profile_simulation.py
This runs a 1000-period simulation and outputs:
* Top 30 functions by cumulative time (including subcalls)
* Top 30 functions by self time (time in function itself)
* Binary profile saved to ``benchmarks/simulation_profile.prof``
Visualize the profile interactively with snakeviz:
.. code-block:: bash
pip install snakeviz
snakeviz benchmarks/simulation_profile.prof
IPython %prun
~~~~~~~~~~~~~
For quick profiling in interactive sessions:
.. code-block:: python
import bamengine as bam
sim = bam.Simulation.init(seed=42)
# Overall profile sorted by cumulative time
%prun -s cumulative sim.run(100)
# Profile a single step
%prun -s tottime sim.step()
Line-level Profiling
~~~~~~~~~~~~~~~~~~~~
For detailed line-by-line analysis, use ``line_profiler``:
.. code-block:: bash
pip install line_profiler
In IPython/Jupyter:
.. code-block:: python
%load_ext line_profiler
# Profile the step method
%lprun -f sim.step sim.run(10)
# Profile a specific internal function
from bamengine.events._internal.goods_market import consumers_decide_firms_to_visit
%lprun -f consumers_decide_firms_to_visit sim.run(10)
Memory Profiling
----------------
Track memory usage with ``memory_profiler``:
.. code-block:: bash
pip install memory_profiler
In IPython/Jupyter:
.. code-block:: python
%load_ext memory_profiler
# Peak memory for a run
%memit sim.run(100)
# Line-by-line memory usage
%mprun -f sim.step sim.run(10)
ASV Benchmarking
----------------
BAM Engine uses `ASV (Airspeed Velocity) `_ for
automated performance tracking with machine-specific baselines.
Running Benchmarks
~~~~~~~~~~~~~~~~~~
.. code-block:: bash
cd asv_benchmarks
# Benchmark current commit
asv run
# Compare two commits
asv continuous HEAD~1 HEAD
# Compare specific benchmark
asv continuous -b SimulationSuite HEAD~1 HEAD
# Generate and view HTML report
asv publish && asv preview
Benchmark Suites
~~~~~~~~~~~~~~~~
The ASV configuration includes seven benchmark suites:
* **SimulationSuite**: Full simulation runs (100/1000 periods) across small/medium/large
* **PipelineSuite**: Single step performance (all events)
* **MemorySuite**: Peak memory during initialization and simulation
* **CriticalEventSuite**: Individual event benchmarks for the critical path (goods/labor/credit markets)
* **InitSuite**: Initialization costs across different scales (100-1000 firms)
* **LoanBookSuite**: Sparse relationship operations (append, aggregate, purge)
* **ScalingSuite**: Performance scaling analysis with agent count (50-400 firms)
Quick Benchmarks (pytest-benchmark)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For fast local benchmarking during development, use the pytest-benchmark tests:
.. code-block:: bash
# Run all quick benchmarks
pytest tests/performance/test_quick_benchmarks.py -v
# With detailed statistics
pytest tests/performance/test_quick_benchmarks.py -v --benchmark-verbose
# Save baseline for comparison
pytest tests/performance/test_quick_benchmarks.py --benchmark-save=baseline
# Compare against saved baseline
pytest tests/performance/test_quick_benchmarks.py --benchmark-compare
These benchmarks cover core operations (single step, initialization) and critical
events (goods/labor market operations) with a small configuration for quick feedback.
Regression Testing
~~~~~~~~~~~~~~~~~~
Performance regression tests run automatically as part of the ``pytest`` suite.
Coverage instrumentation is automatically disabled for performance tests (see
``tests/performance/conftest.py``) to avoid measurement distortion from
``sys.settrace`` overhead, which can inflate timings by 40-60%.
.. code-block:: bash
# Run regression tests (included in full pytest runs)
pytest -m regression
# Run only performance tests
pytest tests/performance/ -v
These tests ensure performance doesn't degrade beyond 15% of established
baselines. Baselines are machine-specific and must be updated manually after
confirmed improvements.
ASV benchmarks complement these tests with cross-commit tracking and
machine-specific baselines, but require separate invocation and do not
integrate into ``pytest``.
Seed Stability Benchmarking
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``benchmarks/bench_seed_stability.py`` script runs large-scale seed
stability analysis: 3 scenarios × 1000 seeds parallelized across 10 workers,
producing JSON result files for the
`bamengine.org stability dashboard `_.
.. code-block:: bash
# Current working tree (all scenarios)
PYTHONPATH=src python benchmarks/bench_seed_stability.py
# Single scenario
PYTHONPATH=src python benchmarks/bench_seed_stability.py --scenario baseline
# Customize seeds/workers
PYTHONPATH=src python benchmarks/bench_seed_stability.py --seeds 500 --workers 8
# Historical commits via git worktrees
python benchmarks/bench_seed_stability.py --tags v0.5.0..v0.6.2
python benchmarks/bench_seed_stability.py --commits HEAD~5..HEAD
# Preview without running
PYTHONPATH=src python benchmarks/bench_seed_stability.py --dry-run
Results are saved as JSON in ``benchmarks/results/`` and committed to the
repository for the ``validation-status`` CI workflow. Each file includes
per-seed pass/fail data, per-metric statistics, and git metadata. Results
are also copied to ``bamengine.org/data/stability/`` for the stability
dashboard.
Architecture Performance Notes
------------------------------
Key optimizations in BAM Engine's architecture:
Vectorization
~~~~~~~~~~~~~
All agent-level operations use NumPy vectorized operations. Avoid Python loops
over agents:
.. code-block:: python
from bamengine import ops
# Good: vectorized (fast)
ops.assign(role.price, ops.multiply(role.price, 1.1))
# Bad: Python loop (slow)
for i in range(len(role.price)):
role.price[i] *= 1.1
Pre-allocation
~~~~~~~~~~~~~~
Fixed-size arrays are allocated at initialization. Avoid dynamic allocation
during simulation:
.. code-block:: python
# Good: use pre-allocated scratch buffer
ops.assign(role.scratch_buffer, computed_values)
# Bad: allocate new array each step
role.scratch_buffer = np.zeros(n_agents)
Sparse Relationships
~~~~~~~~~~~~~~~~~~~~
The LoanBook uses COO (Coordinate List) sparse format for memory efficiency:
* Storage: O(active_loans) instead of O(n_firms × n_banks)
* Efficient append: amortized O(1) with capacity doubling
* Vectorized aggregation: ``np.bincount()`` and ``np.add.at()``
Efficient Sorting
~~~~~~~~~~~~~~~~~
Use ``argpartition`` instead of ``argsort`` when only top-k elements are needed:
.. code-block:: python
from bamengine.utils import select_top_k_indices_sorted
# Get top 10 indices efficiently
top_k = select_top_k_indices_sorted(values, k=10)
Critical Path
~~~~~~~~~~~~~
The market matching system (labor, credit, goods markets) contains the primary
bottlenecks. All three markets now use vectorized batch processing:
* **Goods market**: ``consumers_decide_firms_to_visit`` is fully vectorized using
batch random sampling and 2D array operations. ``goods_market_round`` uses
sequential processing: each consumer completes all Z visits before the next
starts. The inner loop uses Python lists for performance (avoiding NumPy
per-element overhead), adding ~1-4% to total simulation time at all scales.
* **Labor market**: ``workers_decide_firms_to_apply`` uses vectorized firm selection.
``labor_market_round`` processes all applications simultaneously with batch
conflict resolution via ``resolve_conflicts()``.
* **Credit market**: ``credit_market_round`` uses ``grouped_cumsum`` for vectorized
supply exhaustion, processing applications in ascending fragility order.