Skip to content

fix(core): respect Retry-After header on 429 without retiring session#3737

Closed
harryautomazione wants to merge 1 commit into
apify:masterfrom
harryautomazione:fix-429-handling
Closed

fix(core): respect Retry-After header on 429 without retiring session#3737
harryautomazione wants to merge 1 commit into
apify:masterfrom
harryautomazione:fix-429-handling

Conversation

@harryautomazione

Copy link
Copy Markdown

Motivation:
Currently, when a crawler encounters an HTTP 429 Too Many Requests status code, the SessionPool interprets it as an IP block and permanently retires the session. This leads to premature session churn and IP exhaustion, completely ignoring the HTTP standard Retry-After header.

Changes:

  1. Removed 429 from BLOCKED_STATUS_CODES in session_pool/consts.ts so the session isn't automatically retired.
  2. Introduced RateLimitError in errors.ts to capture the retryAfterMs duration.
  3. Intercepted 429s in http-crawler, parsing the Retry-After header (both seconds and date strings) to throw the new RateLimitError.
  4. Added domain cooldown in basic-crawler: Caught RateLimitError inside _requestFunctionErrorHandler to automatically update domainAccessedTime. This elegantly delays the next request to that domain by the specified cooldown without burning the session.

Testing:

  • Added a new unit test in http_crawler.test.ts (should respect 429 RateLimitError and retry). It creates a mock endpoint that returns a 429 with Retry-After: 1 on the first hit, and 200 OK on the second. The test verifies that the crawler waits for the cooldown (~1000ms) and successfully retries.
  • Verified that all other tests (vitest run test/core/crawlers/http_crawler.test.ts) are still passing successfully.

Closes #3623

@barjin barjin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, and thank you for your interest in this project, @harryautomazione !

The BasicCrawler.delayRequest method is actually deprecated, and we're planning to redesign it in v4. The delays should be a concern of the RequestManager, same as in apify/crawlee-python#1762 .

If you want to give it a try and model the JS solution after the Python implementation, you are very much welcome to do so. Please note the PR with these changes should target the v4 branch.

Cheers!

@harryautomazione

Copy link
Copy Markdown
Author

Hi @barjin,

I am closing this PR in favor of the new one targeting the v4 branch: #3741 .

The new PR implements the per-domain request rate-limiting and delay logic at the RequestManager layer via ThrottlingRequestManager, in accordance with the feedback.

Thank you for your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTP 429 responses retire sessions and immediately retry reclaimed requests instead of applying cooldown

4 participants