schema agreement wait is blocked on down nodes with WhiteListRoundRobinPolicy

Can be observed in dtest cluster_replacement_test.py::TestClusterReplacement::test_rack_loss_recovery, which does this:

1. down two nodes
2. execute ALTER KEYSPACE  in the background
3. execute removenode on the downed nodes

If you comment out step 3, the ALTER times out, because schema agreement wait never completes.

There is a check in the driver, which is supposed to skip down nodes:

```python
            if peer and peer.is_up is not False:
                versions[schema_ver].add(endpoint)
``` 

but with WhiteListRoundRobinPolicy, all nodes except the node of the connection are ignored by policy's distance() function. So down and up events are ignored for those nodes. All peers except the contact node have is_up == None, and schema agreement waits for them to catch up (which they won't).

The problem probably happens with other policies as well, e.g. which ignore remote DCs.

I think the right fix for schema agreement wait is to do it on the server side. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema agreement wait is blocked on down nodes with WhiteListRoundRobinPolicy #572

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

schema agreement wait is blocked on down nodes with WhiteListRoundRobinPolicy #572

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions