Skip to content

Conversation

@Martichou
Copy link

Summary

When scraping many URLs continuously, browser contexts accumulate in memory and are never cleaned up. The existing cleanup mechanism only runs when browsers go idle, which never happens under continuous load. This causes memory to grow unbounded until the process crashes or becomes unresponsive.

Fixes #943

Small note: I'm not used to python, I won't lie, Claude helped me a bit here, but I've checked what it did and tested it. So this is not just yet another AI slop :)

List of files changed and why

  • browser_manager.py: Add _context_refcounts tracking, cleanup_contexts(), and release_context() methods
  • async_crawler_strategy.py: Release context ref in finally block after crawl
  • deploy/docker/api.py: Trigger context cleanup after each request

How Has This Been Tested?

This has been tested locally by running the following script and comparing the before/after memory usage with both the master version and the patched version through a docker compose.

The script simply perform 100 scrape with 8 concurrency and report the status code repartition:
https://gist.github.com/Martichou/27555055d130d1c65f6a8457fbeb2a22

Result of the test:

Unpatched version:

Baseline memory usage: 4.5%
End of first test run using unpatched version: 23.4%
End of second test run using unpatched version: 27.6%
End of third test run using unpatched version: 32.8%

Patched version:

Baseline memory usage: 5.7%
End of first test run using unpatched version: 11.2%
End of second test run using unpatched version: 12.3%
End of third test run using unpatched version: 13.4%

It may not have eliminated every leaks (1% gains between run for unknown reason), but closing the browser using the kill browser endpoint make the memory go back to 10%.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ntohidi ntohidi changed the base branch from main to develop November 26, 2025 08:18
aravindkarnam and others added 4 commits December 23, 2025 16:28
When scraping many URLs continuously, browser contexts accumulated in
memory and were never cleaned up. The existing cleanup only ran when
browsers went idle, which never happened under continuous load.
See: unclecode#943.

Key changes:
- browser_manager.py: Add _context_refcounts tracking, cleanup_contexts(),
  and release_context() methods
- async_crawler_strategy.py: Release context ref in finally block after crawl
- deploy/docker/api.py: Trigger context cleanup after each request

This fixes or at least, drastically improve the memory leaks in my testing.
Copilot AI review requested due to automatic review settings January 1, 2026 18:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses a memory leak issue where browser contexts accumulate in memory and are never cleaned up under continuous load. The fix introduces reference counting for contexts and adds periodic cleanup mechanisms.

  • Implements reference counting to track active usage of browser contexts
  • Adds cleanup_contexts() method to periodically close idle contexts
  • Triggers context cleanup after each API request to prevent unbounded memory growth

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

File Description
crawl4ai/browser_manager.py Adds reference counting system (_context_refcounts), cleanup_contexts() method for closing idle contexts, and release_context() method for decrementing refcounts
crawl4ai/async_crawler_strategy.py Adds release_context() call in finally block to decrement refcount when crawl completes
deploy/docker/api.py Triggers cleanup_contexts() after each request to limit context accumulation, with whitespace cleanup
README.md Adds new sponsor (Thor Data) - unrelated to memory leak fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1056 to +1062
# Release the context reference so cleanup can work
if not self.browser_config.use_managed_browser:
try:
config_signature = self.browser_manager._make_config_signature(config)
await self.browser_manager.release_context(config_signature)
except Exception:
pass # Don't fail on cleanup
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release_context call here creates a reference counting imbalance when using session_id. Looking at browser_manager.py get_page(), when a session_id is provided and already exists, the function returns early (line 1063-1066) without incrementing the refcount. However, this release_context will still be called, decrementing a counter that was never incremented. This will cause the refcount to go negative (though clamped to 0 by the max() call in release_context), potentially allowing contexts to be cleaned up while still in use by sessions. The condition should also check that no session_id is being used, similar to: if not self.browser_config.use_managed_browser and not config.session_id:

Copilot uses AI. Check for mistakes.
Comment on lines +1206 to +1209
# If force=True and we still have too many, close active ones too
if force and len(self.contexts_by_config) - len(contexts_to_close) > max_contexts:
remaining_excess = len(self.contexts_by_config) - len(contexts_to_close) - max_contexts
contexts_to_close.extend(active_contexts[:remaining_excess])
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that force will "close contexts even if they have pages (but never if refcount > 0)", but the implementation at lines 1207-1209 will actually close active contexts when force=True, and active_contexts includes contexts with refcount > 0 (added at line 1171-1173). This means force=True can close contexts that are actively being used by requests, contradicting the docstring and potentially causing "Target closed" errors during active crawls. The condition should filter out contexts with refcount > 0 from active_contexts before extending contexts_to_close.

Suggested change
# If force=True and we still have too many, close active ones too
if force and len(self.contexts_by_config) - len(contexts_to_close) > max_contexts:
remaining_excess = len(self.contexts_by_config) - len(contexts_to_close) - max_contexts
contexts_to_close.extend(active_contexts[:remaining_excess])
# If force=True and we still have too many, close additional contexts
# but never close contexts with refcount > 0 (they may be in active use).
if force and len(self.contexts_by_config) - len(contexts_to_close) > max_contexts:
remaining_excess = len(self.contexts_by_config) - len(contexts_to_close) - max_contexts
# From active_contexts, only consider those whose refcount is 0 for forced closure
force_closable_active = [
(sig, ctx)
for sig, ctx in active_contexts
if self._context_refcounts.get(sig, 0) == 0
]
contexts_to_close.extend(force_closable_active[:remaining_excess])

Copilot uses AI. Check for mistakes.
Comment on lines +1219 to +1220
except Exception:
pass
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
pass
except Exception as e:
# Ignore individual page close failures but record them for diagnostics
self.logger.warning(
message="Error closing page during context cleanup: {error}",
tag="WARNING",
params={"error": str(e)}
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants