Skip to content

feat: add is_finished method to RequestQueueClient interface#1982

Open
Mantisus wants to merge 6 commits into
apify:masterfrom
Mantisus:queue-client-is-finished
Open

feat: add is_finished method to RequestQueueClient interface#1982
Mantisus wants to merge 6 commits into
apify:masterfrom
Mantisus:queue-client-is-finished

Conversation

@Mantisus

Copy link
Copy Markdown
Collaborator

Description

  • Adds an is_finished method to the RequestQueueClient interface. It's not a breaking change. The base class defaults is_finished to is_empty when it isn't overridden. Splits the queue predicates. is_empty means there are no requests available to fetch. is_finished means all requests are fully processed (nothing to fetch and nothing in progress).
  • Implements is_finished in all built-in RequestQueueClient.
  • Updates is_empty in all built-in RequestQueueClient to match the new meaning.
  • Affects the AutoscaledPool. It now skips scheduling new tasks when no requests are available to fetch. For example, while the only request is being processed.

Issues

Testing

  • Added a test for RequestQueue that checks the is_empty and is_finished values through the queue lifecycle. It covers all built-in clients.
  • Added a regression test that checks the AutoscaledPool doesn't schedule new tasks while the only request is being processed.

@Mantisus Mantisus requested review from janbuchar and vdusek June 21, 2026 21:23
@Mantisus Mantisus self-assigned this Jun 21, 2026

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sound, the only thing I'm mildly concerned about is the removal of the is_empty return value caching — I'd feel better if I knew some historical context and why it's safe to do this.

Comment on lines -589 to -608
# If we have a cached value, return it immediately.
if self._is_empty_cache is not None:
return self._is_empty_cache

state = self._state.current_value

# If there are in-progress requests, return False immediately.
if len(state.in_progress_requests) > 0:
self._is_empty_cache = False
return False

# If we have a cached requests, check them first (fast path).
if self._request_cache:
for req in self._request_cache:
if req.unique_key not in state.handled_requests:
self._is_empty_cache = False
return False
self._is_empty_cache = True
return len(state.in_progress_requests) == 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's safe to remove all this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I was sure that, in the current implementation of atomic_write, it guarantees that the metadata file is always valid. However, I just double-checked this, and on Windows, there is still a risk of creating an invalid metadata file that would cause the queue to fail.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with cache logic

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please implement apify/apify-sdk-python#987 as well?

That way, we can verify that it works on the platform too.

I'd like this to be part of SDK v4, built on top of Crawlee v1.8 with this included.

@Mantisus

Mantisus commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks! Could you please implement apify/apify-sdk-python#987 as well?

Sure.

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks about right to me. @vdusek please review it too 🙂

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Comment thread src/crawlee/storage_clients/_base/_request_queue_client.py
Comment thread src/crawlee/storage_clients/_redis/_request_queue_client.py Outdated
Comment thread src/crawlee/storage_clients/_file_system/_request_queue_client.py Outdated
Comment thread src/crawlee/storage_clients/_memory/_request_queue_client.py Outdated
Comment thread tests/unit/crawlers/_basic/test_basic_crawler.py Outdated
Mantisus and others added 3 commits June 23, 2026 14:35
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
@Mantisus Mantisus requested a review from vdusek June 23, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an is_finished method to the StorageClient interface Crawler can get softlocked when consuming the RequestQueue

4 participants