Skip to content

Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation#5

Draft
Copilot wants to merge 5 commits intomantid_memory_fragmentationfrom
copilot/add-mantid-memory-fragmentation-test
Draft

Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation#5
Copilot wants to merge 5 commits intomantid_memory_fragmentationfrom
copilot/add-mantid-memory-fragmentation-test

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 29, 2026

  • Create tests/cis_tests/mantid_memory_fragmentation.py matching reference structure
  • Address code review feedback (datetime.now(timezone.utc), explicit dual-cursor naming)
  • Fix OOM on first iteration: CreateSampleWorkspace was using BinWidth=1.0 over [0, 1000] us → 1000 bins × 1,179,648 spectra × 24 B (X+Y+E) ≈ 28 GB Workspace2D after ConvertToMatrixWorkspace. Set BinWidth=TOF_MAX_US (single-bin X-axis), matching the typical live-data chunk shape. Resulting histogram is now ~28 MB.
  • Pick events_per_pixel so that it is explicitly not a power of two (so the underlying std::vector<TofEvent> capacity is strictly larger than its size — the case we want to exercise for the fragmentation investigation) and smaller than the previous value: starting from the previous power-of-two-rounded value (128) and dividing by 3.0 yields events_per_pixel = 42. The event WS is now ~0.74 GB; expected RSS ≈ 5 GB (buffer) + 0.74 GB (event WS) ≈ ~5.74 GB.
  • Set the Mantid framework log level to warning via ConfigService.setString("logging.loggers.root.level", "warning") so the per-iteration RSS prints emitted by this script are easy to find in the otherwise verbose Mantid log stream.
  • Update module docstring to document the bin-count rationale, the events_per_pixel rationale, and the log-level setting
Original prompt

Goal

Add a new CIS test script at tests/cis_tests/mantid_memory_fragmentation.py on the mantid_memory_fragmentation branch. The script exercises the same Mantid code paths that the existing tests/cis_tests/live_data_memory_leak_v2.py exercises after it has obtained a live-data chunk, but without invoking any live-data listener / LoadLiveData / hasLiveDataConnection / readLiveMetadata machinery. It is meant to be runnable on any developer workstation (no bl3-daq1 connection required) so we can investigate Mantid memory fragmentation in isolation.

The reference script we are paralleling is at:
https://github.com/ekapadi/SNAPRed/blob/8478e3d1718b05f85819476f8367b75dae0148fe/tests/cis_tests/live_data_memory_leak_v2.py

Background — what the live-data script does that we want to mimic

In LoadLiveData::exec (Framework/LiveData/src/LoadLiveData.cpp, around lines 488–535) when PreserveEvents=False and the chunk is an EventWorkspace, Mantid creates a Workspace2D by invoking the ConvertToMatrixWorkspace child algorithm and replaces the output workspace with the resulting histogram workspace. We want our test to call ConvertToMatrixWorkspace so we exercise the same code path.

The readLiveMetadata() flow in the reference script then reads the Run of the resulting workspace to extract logs. We mimic that by transferring all log properties from the converted Workspace2D into a Python dict.

Required design

Imports / shared helpers

  • Reuse the structure of live_data_memory_leak_v2.py:
    • Same imports for numpy as np, os, time, logging, tracemalloc, pathlib, datetime, pydantic (only if useful), mantid.simpleapi *, mantid.kernel.ConfigService.
    • Copy the get_pid_rss_kb(pid) helper verbatim from the reference script.
    • Keep the snapred / IPTS_override / not_a_test / pause imports and the IPTS_override() with-block wrapper, so the script structurally matches the reference. The Config["liveData.*"] lookups, hasLiveDataConnection(), _readLiveData, _liveMetadataFromRun, readLiveMetadata functions must NOT appear (this script is explicitly non-live-data).
    • Set up logging the same way (logging.basicConfig(level=logging.DEBUG), logger name "mantid_memory_fragmentation").
    • Set the facility once: ConfigService.setFacility(Config["liveData.facility.name"]) — it is harmless and matches the reference.

Tunable constants (module-level, near top)

N_TESTS = 100                          # number of loop iterations
DT_SLEEP = 0.0                         # seconds between iterations (default: no throttling)
EVENT_WS_TARGET_BYTES = int(2.5 * 1024**3)   # ~2.5 GB per-iteration event workspace
BUFFER_TARGET_BYTES   = int(5.0 * 1024**3)   # ~5 GB circular buffer of float64 doubles
TOF_MAX_US = 1000.0                    # physical TOF range upper bound (microseconds)
RSS_PRINT_INTERVAL_S = 10.0            # throttle RSS printing

TofEvent is 16 bytes in Mantid (double TOF + 8-byte pulse-time). So:

EVENTS_PER_ITER = EVENT_WS_TARGET_BYTES // 16   # ~167 M events for 2.5 GB
BUFFER_LEN      = BUFFER_TARGET_BYTES // 8      # ~671 M float64 values for 5 GB

One-time setup (before the loop)

  1. Allocate the circular buffer of TOF-like doubles, uniformly distributed in [0.0, TOF_MAX_US):

    buffer = np.random.default_rng(seed=12345).uniform(0.0, TOF_MAX_US, size=BUFFER_LEN)

    Print its size in GB after allocation.

  2. Pick names for the two ADS workspaces used each iteration:

    ev_ws_name   = "frag_event_ws"
    hist_ws_name = "frag_hist_ws"
  3. Initialize the empty logs dict:

    logs: dict = {}
  4. Probe SNAP geometry to choose NumBanks/BankPixelWidth for CreateSampleWorkspace:

    • Load SNAP once via LoadEmptyInstrument(InstrumentName="SNAP", OutputWorkspace="__snap_probe") and read getNumberHistograms().
    • SNAP has 18 banks of 256x256 pixels = 1,179,648 spectra. Use NumBanks=18, BankPixelWidth=256 so CreateSampleWorkspace produces a matching histogram count, allowing a clean LoadInstrument(InstrumentName="SNAP", RewriteSpectraMap=True) overlay.
    • Compute events_per_pixel = max(1, EVENTS_PER_ITER // num_hist). Recompute the actual achieved event count and log it.
    • Delete __snap_probe after probing.
  5. Initialize the buffer cursor: offset = 0, and the RSS-print throttle: last_print = 0.0 (so iteration 0 always prints), and start_perf = time.perf_counter().

  6. Start tracemalloc: tracemalloc.start().

Per-iteration loop body (for n_test in range(N_TESTS):)

  1. Create the event workspace with SNAP geometry, filled with ~2.5 GB of events:
    CreateSampleWorkspace(
        OutputWorkspace=ev_ws_name,
        WorkspaceType="Event",
        NumBanks=18,
        BankPixelWidth=256,
        NumEvents=events_per_pixel,
        XMin=0.0,
      ...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add new CIS test script for Mantid memory fragmentation Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation Apr 29, 2026
Copilot AI requested a review from ekapadi April 29, 2026 22:21
@ekapadi
Copy link
Copy Markdown
Owner

ekapadi commented Apr 29, 2026

@copilot When running the new script. During the very first cycle we see an RSS of ~29.5 GB, which triggers an immediate OOM-kill. Please double-check your size calculations in the script?

Copy link
Copy Markdown
Author

Copilot AI commented Apr 29, 2026

@copilot When running the new script. During the very first cycle we see an RSS of ~29.5 GB, which triggers an immediate OOM-kill. Please double-check your size calculations in the ...

Found it — the buffer and event-workspace sizes were correct (~5 GB and ~2.5 GB), but CreateSampleWorkspace(..., XMin=0.0, XMax=1000.0, BinWidth=1.0) produced a 1000-bin X-axis. After ConvertToMatrixWorkspace, the resulting Workspace2D was 1000 bins × 1,179,648 spectra × 24 B (X+Y+E) ≈ 28 GB — that's the ~29.5 GB you saw on iter 0.

The live-data PreserveEvents=False path produces a small (typically single-bin) histogram per chunk, so I set BinWidth=TOF_MAX_US to give a single-bin X-axis. The histogram is now ~28 MB and peak RSS stays at the intended ~5 GB. Fix in a83ef86.

@ekapadi
Copy link
Copy Markdown
Owner

ekapadi commented Apr 29, 2026

@copilot It is working better, but the initial RSS is now about 12 GB (it should be more like 7.5 = 5 <event buffer> + 2.5 <event workspace>). Please change the circular buffer size to be 2.5 GB; AND then, see if you can find where the rest of the extra space is coming from? Also, let's turn Mantid's log-level to WARNING, so that Mantid doesn't spam the logs?

Copy link
Copy Markdown
Author

Copilot AI commented Apr 29, 2026

@copilot It is working better, but the initial RSS is now about 12 GB (it should be more like 7.5 = 5 <event buffer> + 2.5 <event workspace>). Please change the circular buffer s...

The ~4.5 GB overshoot is std::vector<TofEvent> capacity slack inside each EventList. CreateSampleWorkspace populates each spectrum via push_back, so the vector capacity grows in powers of two. With raw events_per_pixel = 167M / 1.18M ≈ 142, the per-spectrum capacity rounds up to 256 — across 1,179,648 spectra that's an extra ~2 GB of slack memory (256 × 1,179,648 × 16 B ≈ 4.5 GB resident vs the intended 2.5 GB). Add buffer (5 GB) + Python/Mantid/tracemalloc overhead and you land near 12 GB.

Fix in bb6be72: round events_per_pixel down to the nearest power of two (142 → 128). Now capacity == size, so the event WS is exactly 128 × 1,179,648 × 16 B = 2.25 GB, and total expected RSS ≈ 7.25 GB (5 GB buffer + 2.25 GB event WS). Docstring updated to document the rounding rationale.

@ekapadi
Copy link
Copy Markdown
Owner

ekapadi commented Apr 30, 2026

@copilot Let's start with the current value for events_per_pixel, and divide it by 3.0 -- I want it to definitely NOT be a power of two, but also to be smaller than it is currently. Also, please set the Mantid log-level to WARNING so we can actually find the RSS logs in the logs stream?

Copy link
Copy Markdown
Author

Copilot AI commented Apr 30, 2026

@copilot Let's start with the current value for events_per_pixel, and divide it by 3.0 -- I want it to definitely NOT be a power of two, but also to be smaller than it is currently....

Done in d6f69e8.

  • events_per_pixel now = previous pow2-rounded value / 3.0 → 128 / 3 = 42 (not a power of two, smaller than before). Event WS ≈ 0.74 GB; expected RSS ≈ 5 GB (buffer) + 0.74 GB (event WS) ≈ 5.74 GB.
  • Mantid framework log level set to warning via ConfigService.setString("logging.loggers.root.level", "warning") so the per-iteration RSS prints stand out in the log stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants