This project trains on image crops rather than full microscopy frames.
The crops are generated by the mpidb_prep utility:
cargo run --bin mpidb_prep -- <data_root> <out_root> <crop_size> <min_mask_area>Defaults:
data_root:dataout_root:mpidb_cropscrop_size:128min_mask_area:25
The output is:
- A directory of saved crop images (
.png) - A
manifest.csvdescribing labels for each crop
For each malaria species directory under data/ (e.g. data/Falciparum, data/Malariae, ...), the tool expects:
data/<Species>/img/: RGB microscopy imagesdata/<Species>/gt/: corresponding binary masks (ground truth)
The tool matches files by exact filename (e.g. gt/XYZ.jpg corresponds to img/XYZ.jpg).
The tool also reads:
data/Uninfected/
These images are used as negative examples (uninfected / no malaria).
The MP-IDB infected dataset provides segmentation masks that localize parasites. Using these masks lets us generate parasite-centered crops. This increases the signal-to-noise ratio during training compared to training on full images.
MP-IDB stage labels are image-level (inferred from filename tokens like R/T/S/G).
That means stage labels are weak with respect to individual parasite crops.
To make this usable, we train stage prediction as presence probability per crop (multi-label), acknowledging it’s weak supervision.
For each infected gt mask image:
-
Load image + mask
- Image is read as RGB (
RgbImage) - Mask is read as grayscale (
GrayImage)
- Image is read as RGB (
-
Connected components
- We scan the mask and run a BFS connected-components search using 4-neighborhood (
left/right/up/down). - Each connected component is assumed to correspond to one parasite region.
- We scan the mask and run a BFS connected-components search using 4-neighborhood (
-
Filter tiny components
- Components with
area < min_mask_areaare discarded. - This removes small artifacts/noise in masks.
- Components with
-
Bounding box extraction
- For each component we compute its bounding box
(min_x, min_y, max_x, max_y).
- For each component we compute its bounding box
-
Context padding
- We expand the bounding box by a fraction of its size.
- Current padding fraction is fixed in code:
pad_frac = 0.25(25% padding)
-
Square padding
- The padded crop rectangle is converted to a square by taking:
side = max(crop_w, crop_h)
- The original rectangular crop is centered into the square canvas.
- The padded crop rectangle is converted to a square by taking:
-
Resize to training size
- Final crop is resized to
crop_size x crop_size(default128x128) usingFilterType::Triangle.
- Final crop is resized to
-
Save crop
- Output path:
mpidb_crops/<Species>/<source_image_id>_<component_index>.png
- Output path:
Uninfected images do not have masks, so we generate a single crop per image:
- Load image as RGB
- Center square crop
- Take the largest centered square from the image (uses the smaller of width/height).
- Resize to
crop_size(default128x128) - Save crop
- Output path:
mpidb_crops/Uninfected/<source_image_id>_0.png
- Output path:
Stages are inferred from tokens in the source_image_id (filename stem).
A stage flag is set to 1 if the token exists:
R=> ringT=> trophozoiteS=> schizontG=> gametocyte
Tokenization splits on -, _, and spaces.
Important: this is not a per-parasite stage ground truth. It is treated as multi-label “presence” supervision.
The tool writes mpidb_crops/manifest.csv with columns:
crop_path: absolute/relative path string written by the toolinfected:1for infected crops,0for uninfected cropsspecies: one ofFalciparum|Malariae|Ovale|Vivax|Uninfectedstage_r:0/1stage_t:0/1stage_s:0/1stage_g:0/1source_image_id: the stem of the original image filename (used for splitting)
When training, the dataset is split using source_image_id so that:
- All crops derived from the same original image stay in the same split
This prevents leakage where near-identical parasite crops from one image appear in both train and validation.
-
If you get zero crops for a species, check:
data/<Species>/gtexists and contains masksdata/<Species>/imgcontains the same filenames- masks are not empty and have non-zero pixels
-
If you see too many tiny crops, increase
min_mask_area. -
If crops cut off parasite context, increase the padding fraction in
crop_and_square_pad(currently0.25).
- Crop tool:
src/bin/mpidb_prep.rs - Manifest-based dataset loader:
src/data.rs(MpIdbDataset) - Training entry point:
src/training.rs