Skip to content

Comments

Exclude decomissioning nodes when opening new shards#6165

Open
ncoiffier-celonis wants to merge 8 commits intoquickwit-oss:mainfrom
ncoiffier-celonis:fix-ingestion-gap-when-decomissioning-node
Open

Exclude decomissioning nodes when opening new shards#6165
ncoiffier-celonis wants to merge 8 commits intoquickwit-oss:mainfrom
ncoiffier-celonis:fix-ingestion-gap-when-decomissioning-node

Conversation

@ncoiffier-celonis
Copy link

@ncoiffier-celonis ncoiffier-celonis commented Feb 20, 2026

Description

Attempt to fix #6158

This PR:

  • broadcast the ingester status through chitchat
  • enrich the ControlPlaneModel to maintain a list of decomissioning indexer
  • filter out the decomissioning nodes when opening new shards, rebalancing or scaling up shards

With this approach, even if we have some propagation delay before decomissioning, it is still possible to fail to ingest some documents if the chitchat takes longer than expected to gossip the ingester state to the control-plane.

I am wondering if this could conflict with the approach implemented here though #6163

Any feedback is welcome!!

How was this PR tested?

In addition of the unit and integration tests, I've run it against a local cluster with 2 indexer and observed that the number of errors reported in #6158 decreases from a few 100 to less than 10.

Other considerations

I also considered these 2 approaches:

  • re-using the indexer state (i.e. READY/NOT_READY, by adding a DRAINING state), but an indexer needs to be ready to successfully completed the decomission process
  • using the shard status itself in the decomissioning routine, but the changes were much more "spaghetti", and I couldnt quite make them working.
  • using gRPC call to have the indexer call the control-plane when decomissioning doesn't seem to fit the rest of the codebase and doesn't seem to be that robust to failure and control-plane restarts.

If we want to de-riskify this change, we could put it behind a feature-flag/config property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indexer graceful shutdown causes ingestion gap and 500 errors "no shards available"

1 participant