Skip to content

Latest commit

 

History

History
151 lines (103 loc) · 4.88 KB

File metadata and controls

151 lines (103 loc) · 4.88 KB

Developer Guide

Crop Generation Strategy (MP-IDB + Uninfected)

This project trains on image crops rather than full microscopy frames. The crops are generated by the mpidb_prep utility:

cargo run --bin mpidb_prep -- <data_root> <out_root> <crop_size> <min_mask_area>

Defaults:

  • data_root: data
  • out_root: mpidb_crops
  • crop_size: 128
  • min_mask_area: 25

The output is:

  • A directory of saved crop images (.png)
  • A manifest.csv describing labels for each crop

Data Sources

Infected (MP-IDB)

For each malaria species directory under data/ (e.g. data/Falciparum, data/Malariae, ...), the tool expects:

  • data/<Species>/img/ : RGB microscopy images
  • data/<Species>/gt/ : corresponding binary masks (ground truth)

The tool matches files by exact filename (e.g. gt/XYZ.jpg corresponds to img/XYZ.jpg).

Uninfected (negative samples)

The tool also reads:

  • data/Uninfected/

These images are used as negative examples (uninfected / no malaria).

Key Design Choice

Why crops?

The MP-IDB infected dataset provides segmentation masks that localize parasites. Using these masks lets us generate parasite-centered crops. This increases the signal-to-noise ratio during training compared to training on full images.

Weak stage labels

MP-IDB stage labels are image-level (inferred from filename tokens like R/T/S/G). That means stage labels are weak with respect to individual parasite crops. To make this usable, we train stage prediction as presence probability per crop (multi-label), acknowledging it’s weak supervision.

Infected Crop Algorithm (from masks)

For each infected gt mask image:

  1. Load image + mask

    • Image is read as RGB (RgbImage)
    • Mask is read as grayscale (GrayImage)
  2. Connected components

    • We scan the mask and run a BFS connected-components search using 4-neighborhood (left/right/up/down).
    • Each connected component is assumed to correspond to one parasite region.
  3. Filter tiny components

    • Components with area < min_mask_area are discarded.
    • This removes small artifacts/noise in masks.
  4. Bounding box extraction

    • For each component we compute its bounding box (min_x, min_y, max_x, max_y).
  5. Context padding

    • We expand the bounding box by a fraction of its size.
    • Current padding fraction is fixed in code:
      • pad_frac = 0.25 (25% padding)
  6. Square padding

    • The padded crop rectangle is converted to a square by taking:
      • side = max(crop_w, crop_h)
    • The original rectangular crop is centered into the square canvas.
  7. Resize to training size

    • Final crop is resized to crop_size x crop_size (default 128x128) using FilterType::Triangle.
  8. Save crop

    • Output path:
      • mpidb_crops/<Species>/<source_image_id>_<component_index>.png

Uninfected Crop Algorithm

Uninfected images do not have masks, so we generate a single crop per image:

  1. Load image as RGB
  2. Center square crop
    • Take the largest centered square from the image (uses the smaller of width/height).
  3. Resize to crop_size (default 128x128)
  4. Save crop
    • Output path:
      • mpidb_crops/Uninfected/<source_image_id>_0.png

Stage Label Inference

Stages are inferred from tokens in the source_image_id (filename stem). A stage flag is set to 1 if the token exists:

  • R => ring
  • T => trophozoite
  • S => schizont
  • G => gametocyte

Tokenization splits on -, _, and spaces.

Important: this is not a per-parasite stage ground truth. It is treated as multi-label “presence” supervision.

Manifest Schema

The tool writes mpidb_crops/manifest.csv with columns:

  • crop_path : absolute/relative path string written by the tool
  • infected : 1 for infected crops, 0 for uninfected crops
  • species : one of Falciparum|Malariae|Ovale|Vivax|Uninfected
  • stage_r : 0/1
  • stage_t : 0/1
  • stage_s : 0/1
  • stage_g : 0/1
  • source_image_id : the stem of the original image filename (used for splitting)

Leakage-Safe Splitting

When training, the dataset is split using source_image_id so that:

  • All crops derived from the same original image stay in the same split

This prevents leakage where near-identical parasite crops from one image appear in both train and validation.

Practical Notes / Debugging

  • If you get zero crops for a species, check:

    • data/<Species>/gt exists and contains masks
    • data/<Species>/img contains the same filenames
    • masks are not empty and have non-zero pixels
  • If you see too many tiny crops, increase min_mask_area.

  • If crops cut off parasite context, increase the padding fraction in crop_and_square_pad (currently 0.25).

Where the Code Lives

  • Crop tool: src/bin/mpidb_prep.rs
  • Manifest-based dataset loader: src/data.rs (MpIdbDataset)
  • Training entry point: src/training.rs