[WIP] Update the bird model #1247

bw4sz · 2025-12-24T18:37:48Z

Description

This PR makes a script to organize and train a new bird detector. It uses data from the original Weinstein et al. 2022 paper, adds in data from the Drones for Ducks and other datasets from lila.science.

I added blank white images to test the performance and can confirm it no longer predicts in blank images with an empty frame accuracy of 100%.

Next steps

Update docs
Compare performance to old detector
Update weights on huggingface
Check tiling sensitivity and optionally add more zoom augmentations.
reach out to community for other images to test against. Add these to docs.
Add a couple non-bird images to test as well.
Quick comparison about segment anything 3, just a screenshot of the browser.
Check tradeoff in precision and recall for different score-thresholds.

Other issues.

There is an issue that needs to be documented in which model.evaluate() needs a size argument (below), but more importantly doesn't give the same results as within the training loop. They may be related. Let's wait until #1238 is solved and confirm. I saw the performance drop completely.

I am quite confused about the CPU memory (@jveitchmichaelis did you see this in other model training). It just doesn't jive with my expectations and back of the envelope calculations. If you have 6 workers, and an average image size of 10MB, and a prefetch of 2 and batch size of 20 = 6 * 2 * 10 * 20 ~ 3GB. We are seeing HUGE memory usage, and it seems like its more within the model.train loop, not in the dataloader. I am concerned about kornia.

Related Issue(s)

I've made a number of issues during this PR

#1246 #1245 #1244

AI-Assisted Development

[x ] I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
[ x] I understand all the code I'm submitting
[ x] I have reviewed and validated all AI-generated code

AI tools used (if applicable):

jveitchmichaelis · 2025-12-24T23:51:49Z

Yeah there are possibly some memory leaks. I've been trying to hunt this down with the DINO branch. You can try aggressively clearing the cache + running gc.collect() at the end of each epoch. It's hard to tell on HPG because you're shown the entire system RAM and not only your own process. Best to debug that locally if you can.

Another one was that the losses should be detached when logging, but not when being returned. Also making sure we don't return things from hooks that should not be called directly, metrics are all reset. Let me gather up my changes and PR.

Here is my trace for a long run after I made an effort to stop this happening, you need to ignore the "background" level from other resident processes on the cluster (ie you can constrain with mem in SLURM and then see how low you can take it; I normally allocate 64GB/GPU for batch size 16-32). You should see it tick up (caching?) and then flatten out).

I'm pretty sure it's nothing to do with validation as the plots look the same, just without the small dip at the end of each training epoch.

bw4sz · 2025-12-30T03:09:22Z

Comparison to old detector show much improved performance on a 90/10 split.


Box Precision:
  Checkpoint:  0.8492
  Pretrained:  0.7495
  Difference:  +0.0997 (+13.30%)

Box Recall:
  Checkpoint:  0.8662
  Pretrained:  0.4645
  Difference:  +0.4017 (+86.47%)

Empty Frame Accuracy:
  Checkpoint:  1.0000
  Pretrained:  0.0000
  Difference:  +1.0000

This is from trainer.validate(model), the results from main.evaluate() feel muddled with #1238. We need to fully understand that issue before merging this PR.

To do is a zero-shot comparison with a new dataset, I am asking the community for a couple images atleast.

bw4sz self-assigned this Dec 24, 2025

bw4sz added 2 commits December 25, 2025 13:52

skeleton of bird training complete

d5bcdf0

make a reproducible script for bird model training

4cdb40e

bw4sz force-pushed the bird_training branch from ca078c5 to 816ca67 Compare December 25, 2025 18:52

bw4sz force-pushed the bird_training branch 2 times, most recently from 245cf82 to 66d57f4 Compare December 30, 2025 18:39

multi-gpu traning and analysis script to compare the old and new model

39ddd75

bw4sz force-pushed the bird_training branch from 66d57f4 to 39ddd75 Compare December 30, 2025 18:40

jveitchmichaelis mentioned this pull request Dec 31, 2025

detach losses when logging #1254

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Update the bird model #1247

[WIP] Update the bird model #1247

bw4sz commented Dec 24, 2025 •

edited

Loading

Uh oh!

jveitchmichaelis commented Dec 24, 2025 •

edited

Loading

Uh oh!

bw4sz commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] Update the bird model #1247

Are you sure you want to change the base?

[WIP] Update the bird model #1247

Conversation

bw4sz commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Next steps

Other issues.

Related Issue(s)

AI-Assisted Development

Uh oh!

jveitchmichaelis commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bw4sz commented Dec 24, 2025 •

edited

Loading

jveitchmichaelis commented Dec 24, 2025 •

edited

Loading

bw4sz commented Dec 30, 2025 •

edited

Loading