Speed up nnUnet inference by robert-graf · Pull Request #110 · Hendrik-code/TPTBox

robert-graf · 2026-06-10T08:58:26Z

Incorporated Claude's suggested settings improvement.

Speed up argmax by 10x if GPU is available.

Add tqdm for argmax (for large shapes only) so people do not think it got stuck

…nto development_robert

All sliding-window tiles share the same patch_size shape, so cuDNN can autotune the fastest conv algorithms once and reuse them across tiles and images. TF32 accelerates fp32 matmul/conv on Ampere+ GPUs with negligible accuracy impact. Gated behind a new fast_perf flag (default True, CUDA only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…erence inference_mode disables autograd view/version tracking entirely, so it is slightly faster and uses less memory than no_grad. All tensors here are pure inference outputs (moved to CPU / converted to numpy downstream), so the stricter inference-tensor semantics are safe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

predict_sliding_window_return_logits runs once per fold; calling torch.cuda.empty_cache() at its start and end returned allocator blocks to the driver right before the next fold reallocated them (slow cudaMalloc + sync). The pool is still cleared once per image after fold averaging, and all OOM-recovery empty_cache() calls are untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

initialize_from_trained_model_folder built loaded_networks by appending the same self.network object once per fold, so every entry ended up holding the LAST fold's weights. predict_logits_from_preprocessed_data then ran that one weight set N times and averaged it, silently collapsing the N-fold ensemble to a single fold while still paying Nx compute. Separately, cache_state_dicts=False left loaded_networks as [] (not None), causing an IndexError in the predict loop. Now loaded_networks is None unless exactly one fold is present (preloaded for zero per-prediction reloads). For >1 fold it stays None and the predict loop swaps weights per fold via load_state_dict, restoring a correct ensemble using a single network's worth of GPU memory. Device-availability warnings now fire once regardless of fold count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Every run_inference_on_file call previously reloaded the predictor from disk and re-uploaded weights to the GPU. With cache_model=True the loaded predictor is kept in a process-wide cache keyed by model path + folds + device/runtime settings, so a loop over many files reuses the in-memory model. When caching, the end-of-call del/empty_cache is skipped so the CUDA allocator pool stays warm between images. Default is False to preserve current memory semantics; the flag forwards through run_VibeSeg/run_vibeseg/run_nnunet via **kwargs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

_run_sub processed one tile per forward pass, leaving the GPU underutilised for small patches. Tiles all share patch_size, so they batch densely: with tile_batch_size>1 they are stacked into a (B, C, *patch) tensor and run in one pass, then scattered back per tile. Mirroring/gaussian/accumulation are unchanged. tile_batch_size=1 (default) takes the original view-based path verbatim, so behaviour and memory are unchanged unless opted in. Threaded through nnUNetPredictor, load_inf_model and run_inference_on_file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds benchmarks/ with a phase-resolved timing tool for run_inference_on_file (load / preprocess / sliding-window predict / postprocess), using CUDA synchronisation, warmup, repeats and peak-memory tracking. It instruments the pipeline by monkeypatching (no library changes) and signature-filters its kwargs so the same file runs against every commit in the optimisation range. - benchmark_nnunet_inference.py: run/compare subcommands; synthetic or real input. - bench_across_commits.sh: replays one config across baseline+each commit and prints per-commit deltas (isolates the always-on changes; the fold fix shows up as the fold_status column flipping from DUPLICATED to lazy-per-fold/distinct). - bench_flag_sweep.sh: sweeps the opt-in flags at HEAD (cache_model needs >1 call, tile_batch_size, max_folds, step_size). - README.md: usage, which tool measures which commit, and caveats. pyproject: exempt benchmarks/ from docstring rules, mirroring the test dirs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nto development_robert

Hendrik-code · 2026-06-10T14:07:50Z

+    return list(np.unique(arr))
+


replace with old_np_unique() call

robert-graf and others added 21 commits May 27, 2026 15:15

fail if not fit on GPU, so I can test it

88e7bf5

Merge branch 'main' into development_robert

edc368f

update memory requirements

4a6ee90

fix bug for very elongated segmentations

48a9d2a

Merge branch 'development_robert' of github.com:Hendrik-code/TPTBox i…

9211f4f

…nto development_robert

should not use Runtime Errors

41fa833

Merge branch 'development_robert' of github.com:Hendrik-code/TPTBox i…

a5e621c

…nto development_robert

bug fixes

365ab8b

Merge branch 'development_robert' of github.com:Hendrik-code/TPTBox i…

99170c8

…nto development_robert

Merge branch 'nnunetinference' into development_robert

69e872e

fix bug made by claude

a7efb70

speed up argmax. Yes this much code is needed for this.

7b49086

ruff

afcb402

update tests. add ravel

d1bfde2

robert-graf requested a review from Hendrik-code June 10, 2026 09:41

Hendrik-code reviewed Jun 10, 2026

View reviewed changes

minor fallback plus updated speedtest

a4d1e6d

Hendrik-code merged commit 267153e into main Jun 10, 2026

Hendrik-code deleted the development_robert branch June 10, 2026 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up nnUnet inference#110

Speed up nnUnet inference#110
Hendrik-code merged 22 commits into
mainfrom
development_robert

robert-graf commented Jun 10, 2026

Uh oh!

Hendrik-code Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robert-graf commented Jun 10, 2026

Uh oh!

Hendrik-code Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants