We have a setup with replicated client-facing bazel-remote instances that write to ephemeral disk and proxy to a single backend bazel-remote instance that writes to NFS.
In rare occurrences when both backend and some "frontend" instances restart, "frontend" instances cannot start fast to serve clients as they are dependent on successful Grpc GetCapabilities request to the proxy backend, which is in the process of restoring in-memory index from disk state, taking tens of minutes in our case, failing startup probe during this time.
This creates real downtime for our clients, and also creates weird asymmetry: "frontend" instances that weren't restarted keep working (failing to proxy requests, but successfully communication with clients and serving from local disk), while new/restarted instances are just dead for significant amount of time.
This feels like a real bug. Proposed solution: make proxy GetCapabilities check lazy, that will happen once the proxy gets back online.
We have a setup with replicated client-facing bazel-remote instances that write to ephemeral disk and proxy to a single backend bazel-remote instance that writes to NFS.
In rare occurrences when both backend and some "frontend" instances restart, "frontend" instances cannot start fast to serve clients as they are dependent on successful Grpc GetCapabilities request to the proxy backend, which is in the process of restoring in-memory index from disk state, taking tens of minutes in our case, failing startup probe during this time.
This creates real downtime for our clients, and also creates weird asymmetry: "frontend" instances that weren't restarted keep working (failing to proxy requests, but successfully communication with clients and serving from local disk), while new/restarted instances are just dead for significant amount of time.
This feels like a real bug. Proposed solution: make proxy GetCapabilities check lazy, that will happen once the proxy gets back online.