Skip to content

What is the correct way to track hit rate? #478

@putnap

Description

@putnap

Hey,

I am fixing our remote bazel cache for our monorepos and I inherited monitoring dashboards which had incorrect way of checking hit/miss ratio:
sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits) / sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits + bazel_remote_disk_cache_misses + bazel_remote_http_cache_misses) which doesn't provide overtime data and resets once server is restarted.

I am now on latest version of the release and looking for a correct way to track this metric. We are not using any 2nd layer solutions. Perhaps using something similar to sum(rate(grpc_server_handled_total{service="cache", grpc_code="OK"}[1h])) / sum(rate(grpc_server_handled_total{service="cache"}[1h])) would work? I just don't have any experience with prometheus on how to set up queries to see hit/miss percentage over time, so I don't know if this is correct in any way and it's not an easy task to confirm the metrics are correct.

Our goal is to have at least 75% hit rate and make sure we get a warning if it drops below.

#472 (comment) contains what I want to see, but from what I read, I assume that there are custom code added to the docker image to be able to produce such metrics.

Any help would be appreciated!

EDIT: What I am looking for is a stable metric that can show real time value that I could use to throw alerts of. I assume that rate period should be selected accordingly. Also not sure which metric is best to use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions