Skip to content

schema agreement wait is blocked on down nodes with WhiteListRoundRobinPolicy #572

@tgrabiec

Description

@tgrabiec

Can be observed in dtest cluster_replacement_test.py::TestClusterReplacement::test_rack_loss_recovery, which does this:

  1. down two nodes
  2. execute ALTER KEYSPACE in the background
  3. execute removenode on the downed nodes

If you comment out step 3, the ALTER times out, because schema agreement wait never completes.

There is a check in the driver, which is supposed to skip down nodes:

            if peer and peer.is_up is not False:
                versions[schema_ver].add(endpoint)

but with WhiteListRoundRobinPolicy, all nodes except the node of the connection are ignored by policy's distance() function. So down and up events are ignored for those nodes. All peers except the contact node have is_up == None, and schema agreement waits for them to catch up (which they won't).

The problem probably happens with other policies as well, e.g. which ignore remote DCs.

I think the right fix for schema agreement wait is to do it on the server side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions