Performance & Profiling ======================= This guide covers performance analysis, profiling, and optimization for developers and power users extending BAM Engine. .. note:: For user-facing performance expectations and configuration guidance, see :ref:`performance-and-scaling`. Optimization Philosophy ----------------------- Before optimizing, follow these principles: 1. **Readability first**: Clear code is maintainable code. Don't sacrifice clarity for marginal performance gains. 2. **Algorithm before implementation**: A better algorithm beats micro-optimization. Review the literature before investing in code-level optimization. 3. **Profile before optimizing**: Measure actual bottlenecks, not assumptions. The critical path is often surprising. 4. **Verify improvements**: Always benchmark before and after changes to confirm the optimization works as expected. Benchmark Results ----------------- Current benchmarks (Apple M4 Pro, macOS 15.2, Python 3.13): .. list-table:: :header-rows: 1 :widths: 15 10 15 10 15 15 * - Configuration - Firms - Households - Banks - 1000 periods - Throughput * - Small - 100 - 500 - 10 - 2.1s - 475 periods/s * - Medium - 200 - 1,000 - 10 - 4.4s - 225 periods/s * - Large - 500 - 2,500 - 10 - 13.6s - 73 periods/s Performance scales approximately linearly with agent count. While NumPy vectorization is highly efficient, the per-agent computation cost means doubling the number of agents roughly doubles simulation time. For historical benchmark tracking across commits, see the `ASV Benchmark Dashboard `_. Profiling --------- cProfile ~~~~~~~~ Generate function-level timing breakdown using the built-in profiling script: .. code-block:: bash python benchmarks/profile_simulation.py This runs a 1000-period simulation and outputs: * Top 30 functions by cumulative time (including subcalls) * Top 30 functions by self time (time in function itself) * Binary profile saved to ``benchmarks/simulation_profile.prof`` Visualize the profile interactively with snakeviz: .. code-block:: bash pip install snakeviz snakeviz benchmarks/simulation_profile.prof IPython %prun ~~~~~~~~~~~~~ For quick profiling in interactive sessions: .. code-block:: python import bamengine as bam sim = bam.Simulation.init(seed=42) # Overall profile sorted by cumulative time %prun -s cumulative sim.run(100) # Profile a single step %prun -s tottime sim.step() Line-level Profiling ~~~~~~~~~~~~~~~~~~~~ For detailed line-by-line analysis, use ``line_profiler``: .. code-block:: bash pip install line_profiler In IPython/Jupyter: .. code-block:: python %load_ext line_profiler # Profile the step method %lprun -f sim.step sim.run(10) # Profile a specific internal function from bamengine.events._internal.goods_market import consumers_decide_firms_to_visit %lprun -f consumers_decide_firms_to_visit sim.run(10) Memory Profiling ---------------- Track memory usage with ``memory_profiler``: .. code-block:: bash pip install memory_profiler In IPython/Jupyter: .. code-block:: python %load_ext memory_profiler # Peak memory for a run %memit sim.run(100) # Line-by-line memory usage %mprun -f sim.step sim.run(10) ASV Benchmarking ---------------- BAM Engine uses `ASV (Airspeed Velocity) `_ for automated performance tracking with machine-specific baselines. Running Benchmarks ~~~~~~~~~~~~~~~~~~ .. code-block:: bash cd asv_benchmarks # Benchmark current commit asv run # Compare two commits asv continuous HEAD~1 HEAD # Compare specific benchmark asv continuous -b SimulationSuite HEAD~1 HEAD # Generate and view HTML report asv publish && asv preview Benchmark Suites ~~~~~~~~~~~~~~~~ The ASV configuration includes seven benchmark suites: * **SimulationSuite**: Full simulation runs (100/1000 periods) across small/medium/large * **PipelineSuite**: Single step performance (all events) * **MemorySuite**: Peak memory during initialization and simulation * **CriticalEventSuite**: Individual event benchmarks for the critical path (goods/labor/credit markets) * **InitSuite**: Initialization costs across different scales (100-1000 firms) * **LoanBookSuite**: Sparse relationship operations (append, aggregate, purge) * **ScalingSuite**: Performance scaling analysis with agent count (50-400 firms) Quick Benchmarks (pytest-benchmark) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For fast local benchmarking during development, use the pytest-benchmark tests: .. code-block:: bash # Run all quick benchmarks pytest tests/performance/test_quick_benchmarks.py -v # With detailed statistics pytest tests/performance/test_quick_benchmarks.py -v --benchmark-verbose # Save baseline for comparison pytest tests/performance/test_quick_benchmarks.py --benchmark-save=baseline # Compare against saved baseline pytest tests/performance/test_quick_benchmarks.py --benchmark-compare These benchmarks cover core operations (single step, initialization) and critical events (goods/labor market operations) with a small configuration for quick feedback. Regression Testing ~~~~~~~~~~~~~~~~~~ Performance regression tests run automatically as part of the ``pytest`` suite. Coverage instrumentation is automatically disabled for performance tests (see ``tests/performance/conftest.py``) to avoid measurement distortion from ``sys.settrace`` overhead, which can inflate timings by 40-60%. .. code-block:: bash # Run regression tests (included in full pytest runs) pytest -m regression # Run only performance tests pytest tests/performance/ -v These tests ensure performance doesn't degrade beyond 15% of established baselines. Baselines are machine-specific and must be updated manually after confirmed improvements. ASV benchmarks complement these tests with cross-commit tracking and machine-specific baselines, but require separate invocation and do not integrate into ``pytest``. Seed Stability Benchmarking ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``benchmarks/bench_seed_stability.py`` script runs large-scale seed stability analysis: 3 scenarios × 1000 seeds parallelized across 10 workers, producing JSON result files for the `bamengine.org stability dashboard `_. .. code-block:: bash # Current working tree (all scenarios) PYTHONPATH=src python benchmarks/bench_seed_stability.py # Single scenario PYTHONPATH=src python benchmarks/bench_seed_stability.py --scenario baseline # Customize seeds/workers PYTHONPATH=src python benchmarks/bench_seed_stability.py --seeds 500 --workers 8 # Historical commits via git worktrees python benchmarks/bench_seed_stability.py --tags v0.5.0..v0.6.2 python benchmarks/bench_seed_stability.py --commits HEAD~5..HEAD # Preview without running PYTHONPATH=src python benchmarks/bench_seed_stability.py --dry-run Results are saved as JSON in ``benchmarks/results/`` and committed to the repository for the ``validation-status`` CI workflow. Each file includes per-seed pass/fail data, per-metric statistics, and git metadata. Results are also copied to ``bamengine.org/data/stability/`` for the stability dashboard. Architecture Performance Notes ------------------------------ Key optimizations in BAM Engine's architecture: Vectorization ~~~~~~~~~~~~~ All agent-level operations use NumPy vectorized operations. Avoid Python loops over agents: .. code-block:: python from bamengine import ops # Good: vectorized (fast) ops.assign(role.price, ops.multiply(role.price, 1.1)) # Bad: Python loop (slow) for i in range(len(role.price)): role.price[i] *= 1.1 Pre-allocation ~~~~~~~~~~~~~~ Fixed-size arrays are allocated at initialization. Avoid dynamic allocation during simulation: .. code-block:: python # Good: use pre-allocated scratch buffer ops.assign(role.scratch_buffer, computed_values) # Bad: allocate new array each step role.scratch_buffer = np.zeros(n_agents) Sparse Relationships ~~~~~~~~~~~~~~~~~~~~ The LoanBook uses COO (Coordinate List) sparse format for memory efficiency: * Storage: O(active_loans) instead of O(n_firms × n_banks) * Efficient append: amortized O(1) with capacity doubling * Vectorized aggregation: ``np.bincount()`` and ``np.add.at()`` Efficient Sorting ~~~~~~~~~~~~~~~~~ Use ``argpartition`` instead of ``argsort`` when only top-k elements are needed: .. code-block:: python from bamengine.utils import select_top_k_indices_sorted # Get top 10 indices efficiently top_k = select_top_k_indices_sorted(values, k=10) Critical Path ~~~~~~~~~~~~~ The market matching system (labor, credit, goods markets) contains the primary bottlenecks. All three markets now use vectorized batch processing: * **Goods market**: ``consumers_decide_firms_to_visit`` is fully vectorized using batch random sampling and 2D array operations. ``goods_market_round`` uses sequential processing: each consumer completes all Z visits before the next starts. The inner loop uses Python lists for performance (avoiding NumPy per-element overhead), adding ~1-4% to total simulation time at all scales. * **Labor market**: ``workers_decide_firms_to_apply`` uses vectorized firm selection. ``labor_market_round`` processes all applications simultaneously with batch conflict resolution via ``resolve_conflicts()``. * **Credit market**: ``credit_market_round`` uses ``grouped_cumsum`` for vectorized supply exhaustion, processing applications in ascending fragility order.