Developer Guide

Crop Generation Strategy (MP-IDB + Uninfected)

This project trains on image crops rather than full microscopy frames. The crops are generated by the mpidb_prep utility:

cargo run --bin mpidb_prep -- <data_root> <out_root> <crop_size> <min_mask_area>

Defaults:

data_root: data
out_root: mpidb_crops
crop_size: 128
min_mask_area: 25

The output is:

A directory of saved crop images (.png)
A manifest.csv describing labels for each crop

Data Sources

Infected (MP-IDB)

For each malaria species directory under data/ (e.g. data/Falciparum, data/Malariae, ...), the tool expects:

data/<Species>/img/ : RGB microscopy images
data/<Species>/gt/ : corresponding binary masks (ground truth)

The tool matches files by exact filename (e.g. gt/XYZ.jpg corresponds to img/XYZ.jpg).

Uninfected (negative samples)

The tool also reads:

data/Uninfected/

These images are used as negative examples (uninfected / no malaria).

Key Design Choice

Why crops?

The MP-IDB infected dataset provides segmentation masks that localize parasites. Using these masks lets us generate parasite-centered crops. This increases the signal-to-noise ratio during training compared to training on full images.

Weak stage labels

MP-IDB stage labels are image-level (inferred from filename tokens like R/T/S/G). That means stage labels are weak with respect to individual parasite crops. To make this usable, we train stage prediction as presence probability per crop (multi-label), acknowledging it’s weak supervision.

Infected Crop Algorithm (from masks)

For each infected gt mask image:

Load image + mask
- Image is read as RGB (RgbImage)
- Mask is read as grayscale (GrayImage)
Connected components
- We scan the mask and run a BFS connected-components search using 4-neighborhood (left/right/up/down).
- Each connected component is assumed to correspond to one parasite region.
Filter tiny components
- Components with area < min_mask_area are discarded.
- This removes small artifacts/noise in masks.
Bounding box extraction
- For each component we compute its bounding box (min_x, min_y, max_x, max_y).
Context padding
- We expand the bounding box by a fraction of its size.
- Current padding fraction is fixed in code:
  - pad_frac = 0.25 (25% padding)
Square padding
- The padded crop rectangle is converted to a square by taking:
  - side = max(crop_w, crop_h)
- The original rectangular crop is centered into the square canvas.
Resize to training size
- Final crop is resized to crop_size x crop_size (default 128x128) using FilterType::Triangle.
Save crop
- Output path:
  - mpidb_crops/<Species>/<source_image_id>_<component_index>.png

Uninfected Crop Algorithm

Uninfected images do not have masks, so we generate a single crop per image:

Load image as RGB
Center square crop
- Take the largest centered square from the image (uses the smaller of width/height).
Resize to crop_size (default 128x128)
Save crop
- Output path:
  - mpidb_crops/Uninfected/<source_image_id>_0.png

Stage Label Inference

Stages are inferred from tokens in the source_image_id (filename stem). A stage flag is set to 1 if the token exists:

R => ring
T => trophozoite
S => schizont
G => gametocyte

Tokenization splits on -, _, and spaces.

Important: this is not a per-parasite stage ground truth. It is treated as multi-label “presence” supervision.

Manifest Schema

The tool writes mpidb_crops/manifest.csv with columns:

crop_path : absolute/relative path string written by the tool
infected : 1 for infected crops, 0 for uninfected crops
species : one of Falciparum|Malariae|Ovale|Vivax|Uninfected
stage_r : 0/1
stage_t : 0/1
stage_s : 0/1
stage_g : 0/1
source_image_id : the stem of the original image filename (used for splitting)

Leakage-Safe Splitting

When training, the dataset is split using source_image_id so that:

All crops derived from the same original image stay in the same split

This prevents leakage where near-identical parasite crops from one image appear in both train and validation.

Practical Notes / Debugging

If you get zero crops for a species, check:
- data/<Species>/gt exists and contains masks
- data/<Species>/img contains the same filenames
- masks are not empty and have non-zero pixels
If you see too many tiny crops, increase min_mask_area.
If crops cut off parasite context, increase the padding fraction in crop_and_square_pad (currently 0.25).

Where the Code Lives

Crop tool: src/bin/mpidb_prep.rs
Manifest-based dataset loader: src/data.rs (MpIdbDataset)
Training entry point: src/training.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer Guide

Crop Generation Strategy (MP-IDB + Uninfected)

Data Sources

Infected (MP-IDB)

Uninfected (negative samples)

Key Design Choice

Why crops?

Weak stage labels

Infected Crop Algorithm (from masks)

Uninfected Crop Algorithm

Stage Label Inference

Manifest Schema

Leakage-Safe Splitting

Practical Notes / Debugging

Where the Code Lives

FilesExpand file tree

DEV_GUIDE.md

Latest commit

History

DEV_GUIDE.md

File metadata and controls

Developer Guide

Crop Generation Strategy (MP-IDB + Uninfected)

Data Sources

Infected (MP-IDB)

Uninfected (negative samples)

Key Design Choice

Why crops?

Weak stage labels

Infected Crop Algorithm (from masks)

Uninfected Crop Algorithm

Stage Label Inference

Manifest Schema

Leakage-Safe Splitting

Practical Notes / Debugging

Where the Code Lives