Hey,
I am fixing our remote bazel cache for our monorepos and I inherited monitoring dashboards which had incorrect way of checking hit/miss ratio:
sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits) / sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits + bazel_remote_disk_cache_misses + bazel_remote_http_cache_misses) which doesn't provide overtime data and resets once server is restarted.
I am now on latest version of the release and looking for a correct way to track this metric. We are not using any 2nd layer solutions. Perhaps using something similar to sum(rate(grpc_server_handled_total{service="cache", grpc_code="OK"}[1h])) / sum(rate(grpc_server_handled_total{service="cache"}[1h])) would work? I just don't have any experience with prometheus on how to set up queries to see hit/miss percentage over time, so I don't know if this is correct in any way and it's not an easy task to confirm the metrics are correct.
Our goal is to have at least 75% hit rate and make sure we get a warning if it drops below.
#472 (comment) contains what I want to see, but from what I read, I assume that there are custom code added to the docker image to be able to produce such metrics.
Any help would be appreciated!
EDIT: What I am looking for is a stable metric that can show real time value that I could use to throw alerts of. I assume that rate period should be selected accordingly. Also not sure which metric is best to use.
Hey,
I am fixing our remote bazel cache for our monorepos and I inherited monitoring dashboards which had incorrect way of checking hit/miss ratio:
sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits) / sum(bazel_remote_disk_cache_hits + bazel_remote_http_cache_hits + bazel_remote_disk_cache_misses + bazel_remote_http_cache_misses)which doesn't provide overtime data and resets once server is restarted.I am now on latest version of the release and looking for a correct way to track this metric. We are not using any 2nd layer solutions. Perhaps using something similar to
sum(rate(grpc_server_handled_total{service="cache", grpc_code="OK"}[1h])) / sum(rate(grpc_server_handled_total{service="cache"}[1h]))would work? I just don't have any experience with prometheus on how to set up queries to see hit/miss percentage over time, so I don't know if this is correct in any way and it's not an easy task to confirm the metrics are correct.Our goal is to have at least 75% hit rate and make sure we get a warning if it drops below.
#472 (comment) contains what I want to see, but from what I read, I assume that there are custom code added to the docker image to be able to produce such metrics.
Any help would be appreciated!
EDIT: What I am looking for is a stable metric that can show real time value that I could use to throw alerts of. I assume that rate period should be selected accordingly. Also not sure which metric is best to use.