Add mantid_memory_fragmentation.py CIS test for isolated Mantid memory fragmentation investigation#5
Conversation
Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/d27bdceb-3b87-4610-a5aa-f8021496019b Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>
|
@copilot When running the new script. During the very first cycle we see an RSS of ~29.5 GB, which triggers an immediate OOM-kill. Please double-check your size calculations in the script? |
Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/5ad47093-9480-4030-8dd4-6b5d85c7e618 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>
Found it — the buffer and event-workspace sizes were correct (~5 GB and ~2.5 GB), but The live-data |
|
@copilot It is working better, but the initial RSS is now about 12 GB (it should be more like 7.5 = 5 <event buffer> + 2.5 <event workspace>). Please change the circular buffer size to be 2.5 GB; AND then, see if you can find where the rest of the extra space is coming from? Also, let's turn Mantid's log-level to WARNING, so that Mantid doesn't spam the logs? |
Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/7484ba84-1e68-46da-af89-8ec5526fdb29 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>
The ~4.5 GB overshoot is Fix in bb6be72: round |
|
@copilot Let's start with the current value for |
Agent-Logs-Url: https://github.com/ekapadi/SNAPRed/sessions/9a78acae-4c9b-4f53-873a-fb62720ecf95 Co-authored-by: ekapadi <5505178+ekapadi@users.noreply.github.com>
Done in d6f69e8.
|
tests/cis_tests/mantid_memory_fragmentation.pymatching reference structuredatetime.now(timezone.utc), explicit dual-cursor naming)CreateSampleWorkspacewas usingBinWidth=1.0over[0, 1000]us → 1000 bins × 1,179,648 spectra × 24 B (X+Y+E) ≈ 28 GB Workspace2D afterConvertToMatrixWorkspace. SetBinWidth=TOF_MAX_US(single-bin X-axis), matching the typical live-data chunk shape. Resulting histogram is now ~28 MB.events_per_pixelso that it is explicitly not a power of two (so the underlyingstd::vector<TofEvent>capacity is strictly larger than its size — the case we want to exercise for the fragmentation investigation) and smaller than the previous value: starting from the previous power-of-two-rounded value (128) and dividing by 3.0 yieldsevents_per_pixel = 42. The event WS is now ~0.74 GB; expected RSS ≈ 5 GB (buffer) + 0.74 GB (event WS) ≈ ~5.74 GB.warningviaConfigService.setString("logging.loggers.root.level", "warning")so the per-iteration RSS prints emitted by this script are easy to find in the otherwise verbose Mantid log stream.events_per_pixelrationale, and the log-level settingOriginal prompt
Goal
Add a new CIS test script at
tests/cis_tests/mantid_memory_fragmentation.pyon themantid_memory_fragmentationbranch. The script exercises the same Mantid code paths that the existingtests/cis_tests/live_data_memory_leak_v2.pyexercises after it has obtained a live-data chunk, but without invoking any live-data listener /LoadLiveData/hasLiveDataConnection/readLiveMetadatamachinery. It is meant to be runnable on any developer workstation (nobl3-daq1connection required) so we can investigate Mantid memory fragmentation in isolation.The reference script we are paralleling is at:
https://github.com/ekapadi/SNAPRed/blob/8478e3d1718b05f85819476f8367b75dae0148fe/tests/cis_tests/live_data_memory_leak_v2.py
Background — what the live-data script does that we want to mimic
In
LoadLiveData::exec(Framework/LiveData/src/LoadLiveData.cpp, around lines 488–535) whenPreserveEvents=Falseand the chunk is anEventWorkspace, Mantid creates aWorkspace2Dby invoking theConvertToMatrixWorkspacechild algorithm and replaces the output workspace with the resulting histogram workspace. We want our test to callConvertToMatrixWorkspaceso we exercise the same code path.The
readLiveMetadata()flow in the reference script then reads theRunof the resulting workspace to extract logs. We mimic that by transferring all log properties from the convertedWorkspace2Dinto a Pythondict.Required design
Imports / shared helpers
live_data_memory_leak_v2.py:numpy as np,os,time,logging,tracemalloc,pathlib,datetime,pydantic(only if useful),mantid.simpleapi *,mantid.kernel.ConfigService.get_pid_rss_kb(pid)helper verbatim from the reference script.snapred/IPTS_override/not_a_test/pauseimports and theIPTS_override()with-block wrapper, so the script structurally matches the reference. TheConfig["liveData.*"]lookups,hasLiveDataConnection(),_readLiveData,_liveMetadataFromRun,readLiveMetadatafunctions must NOT appear (this script is explicitly non-live-data).logging.basicConfig(level=logging.DEBUG), logger name"mantid_memory_fragmentation").ConfigService.setFacility(Config["liveData.facility.name"])— it is harmless and matches the reference.Tunable constants (module-level, near top)
TofEventis 16 bytes in Mantid (double TOF + 8-byte pulse-time). So:One-time setup (before the loop)
Allocate the circular buffer of TOF-like doubles, uniformly distributed in
[0.0, TOF_MAX_US):Print its size in GB after allocation.
Pick names for the two ADS workspaces used each iteration:
Initialize the empty logs dict:
Probe SNAP geometry to choose
NumBanks/BankPixelWidthforCreateSampleWorkspace:LoadEmptyInstrument(InstrumentName="SNAP", OutputWorkspace="__snap_probe")and readgetNumberHistograms().NumBanks=18,BankPixelWidth=256soCreateSampleWorkspaceproduces a matching histogram count, allowing a cleanLoadInstrument(InstrumentName="SNAP", RewriteSpectraMap=True)overlay.events_per_pixel = max(1, EVENTS_PER_ITER // num_hist). Recompute the actual achieved event count and log it.__snap_probeafter probing.Initialize the buffer cursor:
offset = 0, and the RSS-print throttle:last_print = 0.0(so iteration 0 always prints), andstart_perf = time.perf_counter().Start tracemalloc:
tracemalloc.start().Per-iteration loop body (
for n_test in range(N_TESTS):)This pull request was created from Copilot chat.