Speed up nnUnet inference#110
Merged
Merged
Conversation
…nto development_robert
…nto development_robert
All sliding-window tiles share the same patch_size shape, so cuDNN can autotune the fastest conv algorithms once and reuse them across tiles and images. TF32 accelerates fp32 matmul/conv on Ampere+ GPUs with negligible accuracy impact. Gated behind a new fast_perf flag (default True, CUDA only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…erence inference_mode disables autograd view/version tracking entirely, so it is slightly faster and uses less memory than no_grad. All tensors here are pure inference outputs (moved to CPU / converted to numpy downstream), so the stricter inference-tensor semantics are safe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
predict_sliding_window_return_logits runs once per fold; calling torch.cuda.empty_cache() at its start and end returned allocator blocks to the driver right before the next fold reallocated them (slow cudaMalloc + sync). The pool is still cleared once per image after fold averaging, and all OOM-recovery empty_cache() calls are untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
initialize_from_trained_model_folder built loaded_networks by appending the same self.network object once per fold, so every entry ended up holding the LAST fold's weights. predict_logits_from_preprocessed_data then ran that one weight set N times and averaged it, silently collapsing the N-fold ensemble to a single fold while still paying Nx compute. Separately, cache_state_dicts=False left loaded_networks as [] (not None), causing an IndexError in the predict loop. Now loaded_networks is None unless exactly one fold is present (preloaded for zero per-prediction reloads). For >1 fold it stays None and the predict loop swaps weights per fold via load_state_dict, restoring a correct ensemble using a single network's worth of GPU memory. Device-availability warnings now fire once regardless of fold count. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every run_inference_on_file call previously reloaded the predictor from disk and re-uploaded weights to the GPU. With cache_model=True the loaded predictor is kept in a process-wide cache keyed by model path + folds + device/runtime settings, so a loop over many files reuses the in-memory model. When caching, the end-of-call del/empty_cache is skipped so the CUDA allocator pool stays warm between images. Default is False to preserve current memory semantics; the flag forwards through run_VibeSeg/run_vibeseg/run_nnunet via **kwargs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_run_sub processed one tile per forward pass, leaving the GPU underutilised for small patches. Tiles all share patch_size, so they batch densely: with tile_batch_size>1 they are stacked into a (B, C, *patch) tensor and run in one pass, then scattered back per tile. Mirroring/gaussian/accumulation are unchanged. tile_batch_size=1 (default) takes the original view-based path verbatim, so behaviour and memory are unchanged unless opted in. Threaded through nnUNetPredictor, load_inf_model and run_inference_on_file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds benchmarks/ with a phase-resolved timing tool for run_inference_on_file (load / preprocess / sliding-window predict / postprocess), using CUDA synchronisation, warmup, repeats and peak-memory tracking. It instruments the pipeline by monkeypatching (no library changes) and signature-filters its kwargs so the same file runs against every commit in the optimisation range. - benchmark_nnunet_inference.py: run/compare subcommands; synthetic or real input. - bench_across_commits.sh: replays one config across baseline+each commit and prints per-commit deltas (isolates the always-on changes; the fold fix shows up as the fold_status column flipping from DUPLICATED to lazy-per-fold/distinct). - bench_flag_sweep.sh: sweeps the opt-in flags at HEAD (cache_model needs >1 call, tile_batch_size, max_folds, step_size). - README.md: usage, which tool measures which commit, and caveats. pyproject: exempt benchmarks/ from docstring rules, mirroring the test dirs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nto development_robert
Comment on lines
+204
to
+205
| return list(np.unique(arr)) | ||
|
|
Owner
There was a problem hiding this comment.
replace with old_np_unique() call
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Incorporated Claude's suggested settings improvement.
Speed up argmax by 10x if GPU is available.
Add tqdm for argmax (for large shapes only) so people do not think it got stuck