Skip to content

feat: implement per-domain request throttling (ThrottlingRequestManager)#3741

Open
harryautomazione wants to merge 2 commits into
apify:v4from
harryautomazione:v4
Open

feat: implement per-domain request throttling (ThrottlingRequestManager)#3741
harryautomazione wants to merge 2 commits into
apify:v4from
harryautomazione:v4

Conversation

@harryautomazione

Copy link
Copy Markdown

Summary

This PR implements per-domain request rate-limiting and delay logic at the RequestManager layer. It ports the design from Python's Crawlee (PR #1762) to the TypeScript v4 branch, addressing the architectural feedback from PR #3737.

Context & Rationale

Previously, receiving an HTTP 429 status code triggered a SessionError which retired and rotated the proxy/session immediately, leading to high IP churn and fast session depletion.
By moving the throttling to the request manager layer:

  • The crawler now respects Retry-After headers and applies exponential backoff delays precisely to the affected domains.
  • Active sessions/proxies are preserved during transient rate limits, preventing session burning.
  • The wrapper is completely opt-in and backward-compatible.

Key Changes

  • ThrottlingRequestManager (packages/core/src/storages/throttling_request_manager.ts): Implemented a new storage wrapper that implements IRequestManager. It routes domain-specific requests to sub-managers, manages backoff delays, and runs a wake/sleep loop in fetchNextRequest to avoid busy waiting.
  • Crawler Interception:
    • HttpCrawler & BrowserCrawler: Intercepted HTTP 429 status codes, extracted retry-after, registered the delay with ThrottlingRequestManager, and forced a standard retry (avoiding session retirement).
    • BasicCrawler: Integrated with RobotsTxtFile to register the crawl-delay when respectRobotsTxtFile is active.
  • RobotsTxtFile (packages/utils/src/internals/robots.ts): Exposed getCrawlDelay from robots-parser.

Verification Results

  • Unit Tests: Created a full test suite under test/core/storages/throttling_request_manager.test.ts covering routing, exponential backoff, headers parsing, and crawl-delay settings (5/5 passed).
  • Integration Tests: Added tests in test/core/crawlers/http_crawler.test.ts verifying that HTTP 429 delays are respected and session is not retired (17/17 passed).
  • Build: Successfully built the entire workspace with all package declarations compilation checks passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants