Fix memory leak in RL training loop causing OOM at ~220GB by Mr-Neutr0n · Pull Request #483 · microsoft/agent-lightning

Mr-Neutr0n · 2026-02-09T22:02:37Z

Summary

Fixes #438

During RL training, system memory usage grows continuously until ~220GB and crashes. This PR addresses multiple memory leak sources across the store, daemon, trainer, and span exporter:

InMemoryLightningStore: Added cleanup_finished_rollouts() method that removes completed rollout data (rollouts, attempts, spans, sequence IDs) and their associated tracking metadata (_completion_events, _start_time_by_rollout, _span_bytes_by_rollout, _running_rollout_ids, _evicted_rollout_span_sets) from all in-memory data structures after training data has been extracted.
AgentModeDaemon.clear_data_and_server(): Now invokes store cleanup after clearing internal rollout tracking dicts, preventing rollout data from accumulating in the in-memory store across training steps.
AgentLightningTrainer._train_step(): Added explicit del gen_batch after training data extraction, and gc.collect() + torch.cuda.empty_cache() at the end of each training step to release batch tensors and GPU cache.
LightningSpanExporter: Added MAX_BUFFER_SIZE (10000) cap on the span buffer to prevent unbounded growth when spans fail to flush due to missing headers or unavailable store.

Test plan

Run multi-step RL training and monitor memory usage via htop or /proc/meminfo - memory should stabilize rather than growing linearly
Verify training metrics remain unchanged (cleanup happens after data extraction)
Test with both v0 and v1 daemon modes to ensure no regressions
Verify span exporter buffer cap works by checking warning logs when buffer fills

Multiple memory leak sources were identified and fixed: 1. InMemoryLightningStore: Added cleanup_finished_rollouts() method that removes completed rollout data (rollouts, attempts, spans, sequence IDs) and their associated tracking metadata (_completion_events, _start_time_by_rollout, _span_bytes_by_rollout, _running_rollout_ids, _evicted_rollout_span_sets) from all in-memory data structures. 2. AgentModeDaemon.clear_data_and_server(): Now invokes store cleanup after extracting training data, preventing rollout data from accumulating across training steps. 3. AgentLightningTrainer._train_step(): Added explicit deletion of gen_batch after training data extraction, and gc.collect() + torch.cuda.empty_cache() at the end of each training step to release batch tensors. 4. LightningSpanExporter: Added MAX_BUFFER_SIZE (10000) cap on the span buffer to prevent unbounded growth when spans fail to flush due to missing headers or unavailable store. Fixes microsoft#438

Mr-Neutr0n · 2026-02-12T18:11:44Z

Friendly bump! Let me know if there's anything I should update or improve to help move this forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak in RL training loop causing OOM at ~220GB#483

Fix memory leak in RL training loop causing OOM at ~220GB#483
Mr-Neutr0n wants to merge 1 commit intomicrosoft:mainfrom
Mr-Neutr0n:fix/rl-training-memory-leak

Mr-Neutr0n commented Feb 9, 2026

Uh oh!

Mr-Neutr0n commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mr-Neutr0n commented Feb 9, 2026

Summary

Test plan

Uh oh!

Mr-Neutr0n commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant