Proxy backend failure prevents client-facing bazel-remote instances from starting because of grpc GetCapabilities check

We have a setup with replicated client-facing bazel-remote instances that write to ephemeral disk and proxy to a single backend bazel-remote instance that writes to NFS.

In rare occurrences when both backend and some "frontend" instances restart, "frontend" instances cannot start fast to serve clients as they are dependent on successful Grpc GetCapabilities request to the proxy backend, which is in the process of restoring in-memory index from disk state, taking tens of minutes in our case, failing startup probe during this time.

This creates real downtime for our clients, and also creates weird asymmetry: "frontend" instances that weren't restarted keep working (failing to proxy requests, but successfully communication with clients and serving from local disk), while new/restarted instances are just dead for significant amount of time.

This feels like a real bug. Proposed solution: make proxy GetCapabilities check lazy, that will happen once the proxy gets back online.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy backend failure prevents client-facing bazel-remote instances from starting because of grpc GetCapabilities check #896

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proxy backend failure prevents client-facing bazel-remote instances from starting because of grpc GetCapabilities check #896

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions