A customer finds multiple fullsync coordinator workers running simultaneously on each of two clusters. This causes multiple fullsync schedules to run concurrently; the actual fullsync operations may or may not overlap, but each coordinator is active and has its own timer.
This state is reproducible as follows:
- Set up two clusters, A & B.
- Set up REPL and connect them (cluster manager 0.0.0.0:9080).
- Set
fullsync_on_connect to true (unclear whether this step is required).
- Push continuous load onto cluster A.
- Start fullsync with A as source and B as sink.
- While fullsync is running, join one or more new nodes to A.
- On all nodes
riak attach and run supervisor:count_children(whereis(riak_repl2_fscoordinator_sup))..
- Observe that worker count > 0 on more than one node. In my test, it was on the original coordinator and also the newly joined node.
The workaround for this issue is to manually kill all riak_repl2_fscoordinator_sup processes as follows:
- stop & disable fullsync
- wait a few minutes
- on each node attach and run:
Pid = whereis(riak_repl2_fscoordinator_sup). then erlang:exit(Pid,kill)..
- wait a few minutes
- enable & start fullsync
The symptoms of this issue are extremely slow fullsync operations, cluster overload / slowness, and fullsync activity in the logs when no fullsync ought to be running.
A customer finds multiple fullsync coordinator workers running simultaneously on each of two clusters. This causes multiple fullsync schedules to run concurrently; the actual fullsync operations may or may not overlap, but each coordinator is active and has its own timer.
This state is reproducible as follows:
fullsync_on_connecttotrue(unclear whether this step is required).riak attachand runsupervisor:count_children(whereis(riak_repl2_fscoordinator_sup))..The workaround for this issue is to manually kill all
riak_repl2_fscoordinator_supprocesses as follows:Pid = whereis(riak_repl2_fscoordinator_sup).thenerlang:exit(Pid,kill)..The symptoms of this issue are extremely slow fullsync operations, cluster overload / slowness, and fullsync activity in the logs when no fullsync ought to be running.