Skip to content

Proxy backend failure prevents client-facing bazel-remote instances from starting because of grpc GetCapabilities check #896

@ofats

Description

@ofats

We have a setup with replicated client-facing bazel-remote instances that write to ephemeral disk and proxy to a single backend bazel-remote instance that writes to NFS.

In rare occurrences when both backend and some "frontend" instances restart, "frontend" instances cannot start fast to serve clients as they are dependent on successful Grpc GetCapabilities request to the proxy backend, which is in the process of restoring in-memory index from disk state, taking tens of minutes in our case, failing startup probe during this time.

This creates real downtime for our clients, and also creates weird asymmetry: "frontend" instances that weren't restarted keep working (failing to proxy requests, but successfully communication with clients and serving from local disk), while new/restarted instances are just dead for significant amount of time.

This feels like a real bug. Proposed solution: make proxy GetCapabilities check lazy, that will happen once the proxy gets back online.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions