Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation by Copilot · Pull Request #5 · ekapadi/SNAPRed

Copilot · 2026-04-29T22:16:24Z

Create tests/cis_tests/mantid_memory_fragmentation.py matching reference structure
Address code review feedback (datetime.now(timezone.utc), explicit dual-cursor naming)
Fix OOM on first iteration: CreateSampleWorkspace was using BinWidth=1.0 over [0, 1000] us → 1000 bins × 1,179,648 spectra × 24 B (X+Y+E) ≈ 28 GB Workspace2D after ConvertToMatrixWorkspace. Set BinWidth=TOF_MAX_US (single-bin X-axis), matching the typical live-data chunk shape. Resulting histogram is now ~28 MB.
Pick events_per_pixel so that it is explicitly not a power of two (so the underlying std::vector<TofEvent> capacity is strictly larger than its size — the case we want to exercise for the fragmentation investigation) and smaller than the previous value: starting from the previous power-of-two-rounded value (128) and dividing by 3.0 yields events_per_pixel = 42. The event WS is now ~0.74 GB; expected RSS ≈ 5 GB (buffer) + 0.74 GB (event WS) ≈ ~5.74 GB.
Set the Mantid framework log level to warning via ConfigService.setString("logging.loggers.root.level", "warning") so the per-iteration RSS prints emitted by this script are easy to find in the otherwise verbose Mantid log stream.
Update module docstring to document the bin-count rationale, the events_per_pixel rationale, and the log-level setting

Original prompt

Goal

Add a new CIS test script at tests/cis_tests/mantid_memory_fragmentation.py on the mantid_memory_fragmentation branch. The script exercises the same Mantid code paths that the existing tests/cis_tests/live_data_memory_leak_v2.py exercises after it has obtained a live-data chunk, but without invoking any live-data listener / LoadLiveData / hasLiveDataConnection / readLiveMetadata machinery. It is meant to be runnable on any developer workstation (no bl3-daq1 connection required) so we can investigate Mantid memory fragmentation in isolation.

The reference script we are paralleling is at:
https://github.com/ekapadi/SNAPRed/blob/8478e3d1718b05f85819476f8367b75dae0148fe/tests/cis_tests/live_data_memory_leak_v2.py

Background — what the live-data script does that we want to mimic

In LoadLiveData::exec (Framework/LiveData/src/LoadLiveData.cpp, around lines 488–535) when PreserveEvents=False and the chunk is an EventWorkspace, Mantid creates a Workspace2D by invoking the ConvertToMatrixWorkspace child algorithm and replaces the output workspace with the resulting histogram workspace. We want our test to call ConvertToMatrixWorkspace so we exercise the same code path.

The readLiveMetadata() flow in the reference script then reads the Run of the resulting workspace to extract logs. We mimic that by transferring all log properties from the converted Workspace2D into a Python dict.

Required design

Imports / shared helpers

Reuse the structure of live_data_memory_leak_v2.py:
- Same imports for numpy as np, os, time, logging, tracemalloc, pathlib, datetime, pydantic (only if useful), mantid.simpleapi *, mantid.kernel.ConfigService.
- Copy the get_pid_rss_kb(pid) helper verbatim from the reference script.
- Keep the snapred / IPTS_override / not_a_test / pause imports and the IPTS_override() with-block wrapper, so the script structurally matches the reference. The Config["liveData.*"] lookups, hasLiveDataConnection(), _readLiveData, _liveMetadataFromRun, readLiveMetadata functions must NOT appear (this script is explicitly non-live-data).
- Set up logging the same way (logging.basicConfig(level=logging.DEBUG), logger name "mantid_memory_fragmentation").
- Set the facility once: ConfigService.setFacility(Config["liveData.facility.name"]) — it is harmless and matches the reference.

Tunable constants (module-level, near top)

N_TESTS = 100                          # number of loop iterations
DT_SLEEP = 0.0                         # seconds between iterations (default: no throttling)
EVENT_WS_TARGET_BYTES = int(2.5 * 1024**3)   # ~2.5 GB per-iteration event workspace
BUFFER_TARGET_BYTES   = int(5.0 * 1024**3)   # ~5 GB circular buffer of float64 doubles
TOF_MAX_US = 1000.0                    # physical TOF range upper bound (microseconds)
RSS_PRINT_INTERVAL_S = 10.0            # throttle RSS printing

TofEvent is 16 bytes in Mantid (double TOF + 8-byte pulse-time). So:

EVENTS_PER_ITER = EVENT_WS_TARGET_BYTES // 16   # ~167 M events for 2.5 GB
BUFFER_LEN      = BUFFER_TARGET_BYTES // 8      # ~671 M float64 values for 5 GB

One-time setup (before the loop)

Allocate the circular buffer of TOF-like doubles, uniformly distributed in [0.0, TOF_MAX_US):
```
buffer = np.random.default_rng(seed=12345).uniform(0.0, TOF_MAX_US, size=BUFFER_LEN)
```
Print its size in GB after allocation.

Pick names for the two ADS workspaces used each iteration:

ev_ws_name   = "frag_event_ws"
hist_ws_name = "frag_hist_ws"

Initialize the empty logs dict:
```
logs: dict = {}
```
Probe SNAP geometry to choose NumBanks/BankPixelWidth for CreateSampleWorkspace:
- Load SNAP once via LoadEmptyInstrument(InstrumentName="SNAP", OutputWorkspace="__snap_probe") and read getNumberHistograms().
- SNAP has 18 banks of 256x256 pixels = 1,179,648 spectra. Use NumBanks=18, BankPixelWidth=256 so CreateSampleWorkspace produces a matching histogram count, allowing a clean LoadInstrument(InstrumentName="SNAP", RewriteSpectraMap=True) overlay.
- Compute events_per_pixel = max(1, EVENTS_PER_ITER // num_hist). Recompute the actual achieved event count and log it.
- Delete __snap_probe after probing.
Initialize the buffer cursor: offset = 0, and the RSS-print throttle: last_print = 0.0 (so iteration 0 always prints), and start_perf = time.perf_counter().
Start tracemalloc: tracemalloc.start().

Per-iteration loop body (`for n_test in range(N_TESTS):`)

Create the event workspace with SNAP geometry, filled with ~2.5 GB of events:

CreateSampleWorkspace(
    OutputWorkspace=ev_ws_name,
    WorkspaceType="Event",
    NumBanks=18,
    BankPixelWidth=256,
    NumEvents=events_per_pixel,
    XMin=0.0,
  ...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/d27bdceb-3b87-4610-a5aa-f8021496019b Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

ekapadi · 2026-04-29T23:04:09Z

@copilot When running the new script. During the very first cycle we see an RSS of ~29.5 GB, which triggers an immediate OOM-kill. Please double-check your size calculations in the script?

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/5ad47093-9480-4030-8dd4-6b5d85c7e618 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot · 2026-04-29T23:05:57Z

@copilot When running the new script. During the very first cycle we see an RSS of ~29.5 GB, which triggers an immediate OOM-kill. Please double-check your size calculations in the ...

Found it — the buffer and event-workspace sizes were correct (~5 GB and ~2.5 GB), but CreateSampleWorkspace(..., XMin=0.0, XMax=1000.0, BinWidth=1.0) produced a 1000-bin X-axis. After ConvertToMatrixWorkspace, the resulting Workspace2D was 1000 bins × 1,179,648 spectra × 24 B (X+Y+E) ≈ 28 GB — that's the ~29.5 GB you saw on iter 0.

The live-data PreserveEvents=False path produces a small (typically single-bin) histogram per chunk, so I set BinWidth=TOF_MAX_US to give a single-bin X-axis. The histogram is now ~28 MB and peak RSS stays at the intended ~5 GB. Fix in a83ef86.

ekapadi · 2026-04-29T23:27:06Z

@copilot It is working better, but the initial RSS is now about 12 GB (it should be more like 7.5 = 5 <event buffer> + 2.5 <event workspace>). Please change the circular buffer size to be 2.5 GB; AND then, see if you can find where the rest of the extra space is coming from? Also, let's turn Mantid's log-level to WARNING, so that Mantid doesn't spam the logs?

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/7484ba84-1e68-46da-af89-8ec5526fdb29 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot · 2026-04-29T23:29:18Z

@copilot It is working better, but the initial RSS is now about 12 GB (it should be more like 7.5 = 5 <event buffer> + 2.5 <event workspace>). Please change the circular buffer s...

The ~4.5 GB overshoot is std::vector<TofEvent> capacity slack inside each EventList. CreateSampleWorkspace populates each spectrum via push_back, so the vector capacity grows in powers of two. With raw events_per_pixel = 167M / 1.18M ≈ 142, the per-spectrum capacity rounds up to 256 — across 1,179,648 spectra that's an extra ~2 GB of slack memory (256 × 1,179,648 × 16 B ≈ 4.5 GB resident vs the intended 2.5 GB). Add buffer (5 GB) + Python/Mantid/tracemalloc overhead and you land near 12 GB.

Fix in bb6be72: round events_per_pixel down to the nearest power of two (142 → 128). Now capacity == size, so the event WS is exactly 128 × 1,179,648 × 16 B = 2.25 GB, and total expected RSS ≈ 7.25 GB (5 GB buffer + 2.25 GB event WS). Docstring updated to document the rounding rationale.

ekapadi · 2026-04-30T00:29:35Z

@copilot Let's start with the current value for events_per_pixel, and divide it by 3.0 -- I want it to definitely NOT be a power of two, but also to be smaller than it is currently. Also, please set the Mantid log-level to WARNING so we can actually find the RSS logs in the logs stream?

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/9a78acae-4c9b-4f53-873a-fb62720ecf95 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot · 2026-04-30T00:31:17Z

@copilot Let's start with the current value for events_per_pixel, and divide it by 3.0 -- I want it to definitely NOT be a power of two, but also to be smaller than it is currently....

Done in d6f69e8.

events_per_pixel now = previous pow2-rounded value / 3.0 → 128 / 3 = 42 (not a power of two, smaller than before). Event WS ≈ 0.74 GB; expected RSS ≈ 5 GB (buffer) + 0.74 GB (event WS) ≈ 5.74 GB.
Mantid framework log level set to warning via ConfigService.setString("logging.loggers.root.level", "warning") so the per-iteration RSS prints stand out in the log stream.

Initial plan

165fb95

Copilot AI assigned Copilot and ekapadi Apr 29, 2026

Copilot started work on behalf of ekapadi April 29, 2026 22:16 View session

Add tests/cis_tests/mantid_memory_fragmentation.py

64de0e7

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/d27bdceb-3b87-4610-a5aa-f8021496019b Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add new CIS test script for Mantid memory fragmentation~~ Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation Apr 29, 2026

Copilot finished work on behalf of ekapadi April 29, 2026 22:21

Copilot AI requested a review from ekapadi April 29, 2026 22:21

Fix OOM: use single-bin X-axis for event workspace

a83ef86

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/5ad47093-9480-4030-8dd4-6b5d85c7e618 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot started work on behalf of ekapadi April 29, 2026 23:05 View session

Copilot finished work on behalf of ekapadi April 29, 2026 23:06

Copilot started work on behalf of ekapadi April 29, 2026 23:27 View session

Round events_per_pixel down to power of 2 to match vector capacity

bb6be72

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/7484ba84-1e68-46da-af89-8ec5526fdb29 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot finished work on behalf of ekapadi April 29, 2026 23:31

Copilot started work on behalf of ekapadi April 30, 2026 00:29 View session

events_per_pixel = pow2/3.0 (=42); set Mantid log level to warning

d6f69e8

Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/9a78acae-4c9b-4f53-873a-fb62720ecf95 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>

Copilot finished work on behalf of ekapadi April 30, 2026 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation#5

Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation#5
Copilot wants to merge 5 commits intomantid_memory_fragmentationfrom
copilot/add-mantid-memory-fragmentation-test

Copilot AI commented Apr 29, 2026 •

edited

Loading

Uh oh!

ekapadi commented Apr 29, 2026

Uh oh!

Copilot AI commented Apr 29, 2026

Uh oh!

ekapadi commented Apr 29, 2026

Uh oh!

Copilot AI commented Apr 29, 2026

Uh oh!

ekapadi commented Apr 30, 2026

Uh oh!

Copilot AI commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Background — what the live-data script does that we want to mimic

Required design

Imports / shared helpers

Tunable constants (module-level, near top)

One-time setup (before the loop)

Per-iteration loop body (for n_test in range(N_TESTS):)

Uh oh!

ekapadi commented Apr 29, 2026

Uh oh!

Copilot AI commented Apr 29, 2026

Uh oh!

ekapadi commented Apr 29, 2026

Uh oh!

Copilot AI commented Apr 29, 2026

Uh oh!

ekapadi commented Apr 30, 2026

Uh oh!

Copilot AI commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Apr 29, 2026 •

edited

Loading

Per-iteration loop body (`for n_test in range(N_TESTS):`)