Skip to content

Speed up nnUnet inference#110

Merged
Hendrik-code merged 22 commits into
mainfrom
development_robert
Jun 10, 2026
Merged

Speed up nnUnet inference#110
Hendrik-code merged 22 commits into
mainfrom
development_robert

Conversation

@robert-graf

Copy link
Copy Markdown
Collaborator

Incorporated Claude's suggested settings improvement.

Speed up argmax by 10x if GPU is available.

Add tqdm for argmax (for large shapes only) so people do not think it got stuck

robert-graf and others added 21 commits May 27, 2026 15:15
All sliding-window tiles share the same patch_size shape, so cuDNN can
autotune the fastest conv algorithms once and reuse them across tiles and
images. TF32 accelerates fp32 matmul/conv on Ampere+ GPUs with negligible
accuracy impact. Gated behind a new fast_perf flag (default True, CUDA only).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…erence

inference_mode disables autograd view/version tracking entirely, so it is
slightly faster and uses less memory than no_grad. All tensors here are pure
inference outputs (moved to CPU / converted to numpy downstream), so the
stricter inference-tensor semantics are safe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
predict_sliding_window_return_logits runs once per fold; calling
torch.cuda.empty_cache() at its start and end returned allocator blocks to
the driver right before the next fold reallocated them (slow cudaMalloc +
sync). The pool is still cleared once per image after fold averaging, and all
OOM-recovery empty_cache() calls are untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
initialize_from_trained_model_folder built loaded_networks by appending the
same self.network object once per fold, so every entry ended up holding the
LAST fold's weights. predict_logits_from_preprocessed_data then ran that one
weight set N times and averaged it, silently collapsing the N-fold ensemble to
a single fold while still paying Nx compute. Separately, cache_state_dicts=False
left loaded_networks as [] (not None), causing an IndexError in the predict
loop.

Now loaded_networks is None unless exactly one fold is present (preloaded for
zero per-prediction reloads). For >1 fold it stays None and the predict loop
swaps weights per fold via load_state_dict, restoring a correct ensemble using
a single network's worth of GPU memory. Device-availability warnings now fire
once regardless of fold count.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every run_inference_on_file call previously reloaded the predictor from disk
and re-uploaded weights to the GPU. With cache_model=True the loaded predictor
is kept in a process-wide cache keyed by model path + folds + device/runtime
settings, so a loop over many files reuses the in-memory model. When caching,
the end-of-call del/empty_cache is skipped so the CUDA allocator pool stays
warm between images. Default is False to preserve current memory semantics; the
flag forwards through run_VibeSeg/run_vibeseg/run_nnunet via **kwargs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_run_sub processed one tile per forward pass, leaving the GPU underutilised for
small patches. Tiles all share patch_size, so they batch densely: with
tile_batch_size>1 they are stacked into a (B, C, *patch) tensor and run in one
pass, then scattered back per tile. Mirroring/gaussian/accumulation are
unchanged. tile_batch_size=1 (default) takes the original view-based path
verbatim, so behaviour and memory are unchanged unless opted in. Threaded
through nnUNetPredictor, load_inf_model and run_inference_on_file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds benchmarks/ with a phase-resolved timing tool for run_inference_on_file
(load / preprocess / sliding-window predict / postprocess), using CUDA
synchronisation, warmup, repeats and peak-memory tracking. It instruments the
pipeline by monkeypatching (no library changes) and signature-filters its kwargs
so the same file runs against every commit in the optimisation range.

- benchmark_nnunet_inference.py: run/compare subcommands; synthetic or real input.
- bench_across_commits.sh: replays one config across baseline+each commit and
  prints per-commit deltas (isolates the always-on changes; the fold fix shows up
  as the fold_status column flipping from DUPLICATED to lazy-per-fold/distinct).
- bench_flag_sweep.sh: sweeps the opt-in flags at HEAD (cache_model needs >1 call,
  tile_batch_size, max_folds, step_size).
- README.md: usage, which tool measures which commit, and caveats.

pyproject: exempt benchmarks/ from docstring rules, mirroring the test dirs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@robert-graf robert-graf requested a review from Hendrik-code June 10, 2026 09:41
Comment thread TPTBox/core/np_utils.py Outdated
Comment on lines +204 to +205
return list(np.unique(arr))

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with old_np_unique() call

@Hendrik-code Hendrik-code merged commit 267153e into main Jun 10, 2026
@Hendrik-code Hendrik-code deleted the development_robert branch June 10, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants