Optimization#111
Conversation
…any) cc3d statistics also computes centroids+bboxes that np_volume discards. For many-label arrays (e.g. connected-component maps) np.bincount is far cheaper: ~37x faster on a 64k-component map and ~3x on ~400 labels, while cc3d stays faster for the few-label anatomical-segmentation case. Switch on a cheap arr.max()>256 check to get best-of-both. Verified equal across dtypes / label counts / include_zero; speedtest_volume.py added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
For 3D arrays, two of the three axis extents are derived from a single shared 2D projection (np.any over the contiguous last axis), so only one extra full reduction is needed. ~13-18% faster across 256^3/512^3 and px_dist values; identical slices (verified vs old impl on 2D+3D, incl. empty handling). Generic n-D path unchanged. speedtest_bbox_binary.py added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ding_boxes np_center_of_mass and np_bounding_boxes built a `unique` list then filtered with `idx in unique` (O(max_label x n_unique)). Check voxel_counts[idx] directly instead. ~2.2x faster at ~4k labels, ~1.3x at ~2k, unchanged for few labels; identical output (use_crop preserved). speedtest_center_of_mass.py added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…+mask The per-label per-iteration 'data = out.copy(); data[i != data] = 0' was a full array copy plus a masked write. _binary_dilation casts its input to bool anyway, so 'data = out == i' is bit-exact (verified across 6480 configs) and skips the copy. ~11-18% faster across n_pixel/connectivity on few-label 150^3 segs. The n_pixel loop is kept (it encodes iterative inter-label competition that a single larger-kernel pass would not reproduce). speedtest_dilate_vectorized.py added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
np.isin is 3-6x slower than a boolean lookup table for multi-label membership on uint segmentation masks. New np_isin() builds lut[labels]=True and gathers lut[arr] for unsigned arrays with a small label range; it special-cases the single-label (arr==label) case and falls back to np.isin for signed/negative/ huge-range inputs. Verified equal to np.isin across 132 dtype/label/invert cases. Applied at the 9 multi-label np.isin sites (extract_label, erode/dilate (+euclid), connected_components, filter_connected_components). numpy's own kind='table' did not help. speedtest_isin_lut.py added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The keep_label path called get_seg_array() twice (two full array copies) plus np_extract_label and a multiply. Now it takes a single copy and zeros voxels not in the label set via np_isin, keeping original label values. ~1.74x faster on a 300^3 mask; output identical (verified scalar+list, keep+binary paths). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The per-label 'seg_arr[seg_arr == l] = 0' loop costs one full pass per removed
label (linear in label count). A single np_map_labels gather ({label: fill})
is constant-time: tied with the loop for a few labels, ~2.2x faster at 20
labels (sparse) and ~6x on dense masks. Enums are now resolved to .value like
extract_label does (the int path is unchanged). Verified equal across
scalar/list/nested labels and removed_to_label values.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… verbose The two np_unique full-array scans only feed the verbose log line. Guarding them behind 'verbose' makes the common in-loop verbose=False path ~5x faster on a 300^3 mask. The verbose=True output is unchanged; the returned data is identical either way. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
It called extract_label(...).get_seg_array() twice (each a full round-trip: copy + np_extract_label + NII construction + copy) just to get two binary masks. Both masks now come directly from the single get_array() via np_isin. ~2.15x faster on a 300^3 mask; output identical (verified across idx/not_beyond/axis/inclusion). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…mass pass The default _crop path looped over every label doing extract_label(i) + compute_crop + scipy center_of_mass. np_center_of_mass (cc3d) returns every label's centroid in a single pass. ~5x faster at 8-16 labels and ~9x at 20-40 labels; output bit-identical (verified 379/379 points exact to the rounded decimal). The non-_crop fallback is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
POI_Global construction (to_global) and to_other applied the affine transform one point at a time in a Python loop. Added vectorized local_to_global_arr (POI) and global_to_local_arr (Has_Grid) that transform an (N,3) array in a single matmul, and use them in those loops. ~7-8x faster (100-400 points); output bit-identical (verified vs per-point, with/without itk_coords). to_other keeps the per-point path when verbose=True to preserve its logging. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
flatten-mode filtering did candidates.copy() then list.remove() per dropped file (each remove is O(n) -> O(n^2) overall). Replaced with a single list comprehension; ~48x faster filtering 2000 candidates. The dict-mode branches likewise drop the throwaway dict copy()+pop() for a dict comprehension. Output identical (verified across flatten/dict x keys for both filter methods). The comprehension also removes by identity, avoiding list.remove's first-equal removal quirk. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
_get_mesh called from_segmentation_nii(extract_label(u)) for every label, and that reorients + rescales (resamples) the image each time. Reorient/rescale commute with extract_label for nearest-neighbour segmentation resampling, so the image is now transformed once before the loop. ~5x (12 labels) to ~7x (25 labels) faster on the transform; the per-label marching-cubes meshes are bit-identical (verified arrays and mesh vertices for rescale_to_iso True/False). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR focuses on performance optimizations across segmentation/label-array utilities, POI coordinate transforms, mesh preview generation, and BIDS candidate filtering, and adds a set of speedtest scripts to benchmark the proposed improvements.
Changes:
- Introduces faster label/segmentation primitives (
np_isinLUT path,np_volumeheuristic, fasternp_center_of_mass/np_bounding_boxes, 3D-specializednp_bbox_binary, and reduced-copynp_dilate_mskinner loop). - Vectorizes POI coordinate conversions and accelerates centroid computation by using a single cc3d statistics pass.
- Optimizes higher-level workflows (mesh preview label loop, NII label operations, BIDS filter loops) and adds multiple benchmarking scripts under
tests/speedtests/.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| TPTBox/tests/speedtests/speedtest_volume.py | New benchmark for np_volume implementations across label-count regimes. |
| TPTBox/tests/speedtests/speedtest_poi_to_global.py | New benchmark for batched POI local→global conversion. |
| TPTBox/tests/speedtests/speedtest_poi_calc_centroids.py | New benchmark for centroid computation approaches. |
| TPTBox/tests/speedtests/speedtest_nii_truncate_masks.py | New benchmark for mask extraction optimization in truncation logic. |
| TPTBox/tests/speedtests/speedtest_nii_remove_labels.py | New benchmark for remove_labels implementations (loop vs map/isin). |
| TPTBox/tests/speedtests/speedtest_nii_map_labels.py | New benchmark for avoiding np_unique scans when verbose=False. |
| TPTBox/tests/speedtests/speedtest_nii_extract_label_keep.py | New benchmark for extract_label(..., keep_label=True) optimization. |
| TPTBox/tests/speedtests/speedtest_mesh_preview_hoist.py | New benchmark for hoisting reorient/rescale outside per-label mesh loop. |
| TPTBox/tests/speedtests/speedtest_isin_lut.py | New benchmark comparing np.isin modes vs explicit LUT. |
| TPTBox/tests/speedtests/speedtest_dilate_vectorized.py | New benchmark for reduced-copy dilation inner loop (out == i). |
| TPTBox/tests/speedtests/speedtest_center_of_mass.py | New benchmark for direct voxel-count filtering in cc3d stats postprocessing. |
| TPTBox/tests/speedtests/speedtest_bids_filter.py | New benchmark for O(n) list comprehension filter vs O(n²) remove loop. |
| TPTBox/tests/speedtests/speedtest_bbox_binary.py | New benchmark for 3D np_bbox_binary 2-pass specialization. |
| TPTBox/mesh3D/html_preview.py | Hoists reorient/rescale once for per-label mesh generation. |
| TPTBox/core/poi.py | Adds local_to_global_arr and speeds up calc_centroids (cc3d-based path). |
| TPTBox/core/poi_fun/poi_global.py | Uses batched affine/inverse-affine conversions when not verbose. |
| TPTBox/core/np_utils.py | Adds np_isin; updates multiple utilities to use it; optimizes volume/COM/bbox/dilate/bbox_binary. |
| TPTBox/core/nii_wrapper.py | Uses np_isin in truncation/extract-label; avoids verbose-only scans; speeds remove_labels via np_map_labels. |
| TPTBox/core/nii_poi_abstract.py | Adds global_to_local_arr vectorized conversion. |
| TPTBox/core/bids_files.py | Replaces copy+remove loops with comprehensions (flatten and dict modes). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else: | ||
| arrc = arr | ||
| if labels is not None: | ||
| arrc = arrc.copy() | ||
| arrc[np.isin(arr_bin, labels, invert=True)] = 0 | ||
| arrc[np_isin(arr_bin, labels, invert=True)] = 0 |
| # np.bincount wins decisively when there are many labels (e.g. connected-component maps); | ||
| # cc3d statistics is faster for the few-label case typical of anatomical segmentations. | ||
| counts = np.bincount(arr.ravel()) if int(arr.max()) > 256 else cc3dstatistics(arr, use_crop=not include_zero)["voxel_counts"] | ||
| if include_zero: | ||
| return {idx: i for idx, i in dict(enumerate(cc3dstatistics(arr, use_crop=False)["voxel_counts"])).items() if i > 0} | ||
| else: | ||
| return {idx: i for idx, i in dict(enumerate(cc3dstatistics(arr)["voxel_counts"])).items() if i > 0 and idx != 0} | ||
| return {idx: i for idx, i in enumerate(counts) if i > 0} | ||
| return {idx: i for idx, i in enumerate(counts) if i > 0 and idx != 0} |
| ctd_list[first_stage, i] = out_coord | ||
| else: | ||
| ctd_list[int(i), second_stage] = out_coord | ||
| ctd_list[i, second_stage] = out_coord |
There was a problem hiding this comment.
add int(x) back in. I do not want to get float/np.array or strings ihn here. This would lead to errors.
|
|
||
| eq = lambda x, y: x == y # noqa: E731 | ||
|
|
||
| for n_labels in (100, 400): |
There was a problem hiding this comment.
How much time does this save? Usually, poi resampling is negligible fast.
|
Fix the Copilete and my int comment. Rest LGTM |
No description provided.