Add support for the compressed-blobs REAPI proposal#386
Conversation
|
@glukasiknuro, @ulrfa, @bdittmer, @buchgr: FYI- this is a significant change. Once this lands and stabilises, I'll make a 2.0.0 release. |
| hash := ks[len(ks)-sha256.Size*2:] | ||
| var kind cache.EntryKind = cache.AC | ||
| if strings.HasPrefix(ks, "cas") { | ||
| kind = cache.CAS |
There was a problem hiding this comment.
This needs to be updated to handle uncompressed .v1 files.
I should also add back some eviction tests with uncompressed storage mode, since the size is predictable again.
There was a problem hiding this comment.
The proxy interface also needs fixing.
ulrfa
left a comment
There was a problem hiding this comment.
Interesting @mostynb! I think the network bandwidth saving can be beneficial for my use cases! 🎉
Do you know if the bazel client, and others such as buildbarn, have plans to implement or make use of this feature?
Do you believe clients will typically use zstd for all their CAS uploads/downloads (including small files)? Do you expect transcoding to be common?
However, I wonder about reading from offset in compressed files. What use cases do you have in mind for that? It seems to me that it comes with drawbacks:
A) Need for header in each file.
B) Re-compression when receiving already compressed files via HTTP PUT.
C) Complexity.
D) Performance overhead (due to A and B).
For me it would be attractive to have the files stored without the extra header, and instead with a .zst file name extension (RFC 8478). Primarily for performance reasons, but also to allow digging into the cache directories with standard zstd command line tools. The logical size could also be embedded in the file name if needed.
Have you run performance benchmarks?
| if err != nil { | ||
| return err | ||
| if !legacy { | ||
| size, err = casblob.GetLogicalSize(f.name) |
There was a problem hiding this comment.
I wonder how long time it takes to restart bazel-remote, when data have to be read from each file?
An alternative could be to embed the logical size as part of the file name.
There was a problem hiding this comment.
In my largest bazel-remote instance that's running something close to this branch, restarting is fast enough. It's still growing though, so I can re-check once it has filled the space.
There was a problem hiding this comment.
The CAS in one of our cache instances is 1500 GB and contains 8 000 000 files.
I’m emulating restarting the cache by reading 8 bytes from them. Running “xargs head -c 8 > /dev/null” over them took 20 minutes. That is too long for us.
Re-running again with warm file system caches took 45 seconds.
(A bazel-remote instance was using the CAS at the same time)
Some of our caches are even larger. All of them are on SSD, but other people might have rotating disks where these 8 byte random-accesses takes even more time.
There was a problem hiding this comment.
I think a 45 second restart for a large cache is reasonable, though not ideal.
A 20 minute cold start is pretty bad though. There is one other potential startup optimization that could help- loading file details concurrently (I know that @bdittmer has a local patch for this, which helps even at small scale). I wonder if you could try using xargs -P 72 in your benchmark (and maybe a couple of lower values too)?
(I don't think we should bother considering HDD-speeds, it's the wrong tool for the job and I would rather just recommend consumer SSD-like speed storage as a minimum.)
There was a problem hiding this comment.
45 second restart time is longer than I would like in production. Because the longer time the cache is down, while restarting to upgrade version or change configuration, the higher risk for failing builds-without-the-bytes assumption about cache always available.
A first try with -P 72 took 4 minutes, but when re-trying with -P 72 more times, it took 1m45s, despite clearing filesystem cache with "sync; echo 3 > /proc/sys/vm/drop_caches" in between. Due to the difference, I don't fully trust the numbers. I wonder if there are some other caching layer involved, or if the first try coincided with peak load from the bazel-remote instance also running towards the disk. The fastest time I got was with -P 64, which was 1m 17s. If you are interested, I can list values for all -P values I tried.
But it come to my mind that there is one more large issue: Reading headers from files during startup, would touch file system access time, and that affects the ordering in LRU.
I believe all these issues can be avoided by simply storing the logical size embedded in the file name, like: cas/ef95d34495b36c0a6108c4e48679ebebfb1cee01364e3dceead1f95ae4276d2b_450.zst, if the logical size is 450 bytes and compressed with zstandard.
Or even better, I would prefer having both the logical size and the on-disk-size embedded in the file name. Since if there is a guaranteed one-to-one association between disk.lruItem entries and file names, then we can consider releasing the mutex during some syscalls in disk.go, as discussed in #323. Any changes to the mutexes could be considered in later separate work, but embedding both sizes, as preparation, would be easier in this PR.
Example zstandard compressed with logical size 450 bytes and on-disk-size 290 bytes:
cas/ef95d34495b36c0a6108c4e48679ebebfb1cee01364e3dceead1f95ae4276d2b_450_290.zst
Example non-compressed with both logical and on-disk size as 450 bytes:
cas/ef95d34495b36c0a6108c4e48679ebebfb1cee01364e3dceead1f95ae4276d2b_450
There was a problem hiding this comment.
Those numbers look bad enough to give up on reading all the headers during loading. I will start refactoring.
I don't think the combination of logical size and on-disk size would be unique enough for the early mutex release idea- that would probably require storing a pseudo-random string or number in each lruItem.
Aside: using atimes for the initial LRU order is only approximate, because Contains updates the LRU but for efficiency reasons does not touch the filesystem (builds-without-the-bytes makes this worse). Reading data from file headers didn't make things much worse because the files were sorted by atime in one pass, then imported in that order in a second pass (but it added bias to the filesystem atime order for compressed CAS items). If the cache was running long enough then the atime order would become more similar to the LRU index order.
In the future, I might consider writing out the index order to disk prior to clean shutdown (and maybe also periodically) and then only use atimes as a fallback.
There was a problem hiding this comment.
I don't think the combination of logical size and on-disk size would be unique enough for the early mutex release
I believe it would be enough, since then it would not matter how concurrent PUT requests of same key are interleaved. Because all their LRU entry commits would be compatible with any of their files, and it would not matter if we end up with LRU commit from one request and file from another. PUT requests interleaved with evictions of same key, is more complex, but I have hope that is also possible without holding mutexes during file system syscalls.
Reading data from file headers didn't make things much worse because the files were sorted by atime in one pass, then imported in that order in a second pass
Our CAS file systems are mounted with the relatime option, so each access time is not updated more than once a day. That is normally not an issue, since files remains in our caches much longer than a day. But I suspect that same order in first and second pass would not have been sufficient with relatime.
Thanks for the refactoring!
Bazel plans to implement this: bazelbuild/bazel#12670 I don't think buildbarn is interested in implementing this for now (but I will offer to contribute patches later).
It's entirely up to the client, so hard to predict (this is why I added a storage mode config option). However zstandard is fast enough that it might be OK to do a lot of transcoding, and there is still some benefit to be had from the increased effective cache size. I suspect that bazel will add a simple on/off setting for using compressed-blobs (ie use it for all blobs, or for none), and maybe stop there.
The use case is for resuming interrupted downloads of large blobs, without resorting to downloading in uncompressed form or decompressing the entire blob + recompressing the tail. With the chunked format, at most one chunk needs to be recompressed. The server needs to decompress the blob after receiving it in order to verify the hash anyway, so it's just one extra compression, which should be OK if blobs are likely to be read multiple times.
There is a seekable zstandard format, but as far as I know nothing supports it yet. I would consider migrating to that later if it takes off. In the meantime if needed we could provide a small utility which strips the header from a blob (the concatenated chunks can be processed by standard tools).
Transferring large compressible blobs is much faster (of course), and I see cache sizes about 30% of the uncompressed size in my test cases (this will vary for different workloads). |
Are people often experiencing interrupted downloads of huge files from CAS? (Personally, I’m not) I wonder how the performance is affected with huge amount of concurrent HTTP requests? I would be happy to run benchmarks with 72 core machines and 20 Gbps network. But how to best do that now without zstd support in bazel client? Would it make sense to benchmark the following scenarios?
|
1d664a1 to
ad8de4d
Compare
I don't have an answer to this, because IIRC it's possible to download uncompressed blobs from an offset and therefore I'm unlikely to get complaints about it.
I think the best we can do in the meantime is try to put together an artificial stress-test setup using something like curl (recent versions support zstd downloads) or a custom client (which is fairly easy to write). Then we can ramp up the load with xargs or make or similar.
That is measurable today, with bazel configured to use remote cache via gRPC or HTTP. We should expect higher CPU and memory usage on the server, but hopefully a larger effective cache size.
There should be no extra load if Update: I need to fix the Put code path. |
|
BTW, I am also thinking through some
|
|
Interesting, maybe we could implement the header as a skippable frame (or multiple skippable frames), in which case the entire compressed file would be decodable by any compliant zstandard decoder: https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#skippable-frames The cost compared to the current header would only be 8 bytes per file for most files. |
b8b1f23 to
ee44820
Compare
|
Pushed some new commits:
The first two changes break compatibility with cache directories created with earlier versions of this branch/PR. (The proxy backends are still broken, that's next on my TODO list.) |
Interesting idea! The client might have additional information to make decision if certain files are worth compressing or not, then it could make sense to also store them in CAS as they were uploaded. Or bazel-remote could decide to not compress files smaller than 4 KB? Since those would still consume a whole file system block anyway. Below follows an example of the file size distribution in one of our caches. Note that 60% of the files in that CAS have size of 128 bytes or less. Those should not be compressed. Also note that 98% of the files would fit in a single 1M chunk. For them the seekable zstd table is unnecessary. |
Note that there are two other costs to consider: network transfer time (even if we didn't save disk space, we might save transfer time) and re-compression time (eg if new clients request all blobs in compressed form). This is something that I think we could experiment with in the "hybrid" storage mode.
Interesting- can you share the language(s) in use here?
Right, it's an optimization for large blobs. |
Agree about experiments. For example I’m curious about how much the network transfer is affected by compressing small files that would have fit in a single 1500 byte ethernet MTU even as non-compressed. Regarding re-compression: If bazel-remote decides to store some files uncompressed, then I assume bazel-remote would have the choice of transferring them uncompressed, even if client also accepts zstd encoding?
I think it is mostly building C code and running tests with a proprietary test framework. Many of the tiny files seems to be REv2 messages, such as Action messages that mainly contains only two Digests. That particular cache instance is not used for remote execution, but those messages are always uploaded by bazel client to CAS also for pure caching setups. However, they are only downloaded if using remote execution. (Note that I mean the Action message in CAS, not the ActionResult message in AC) One-liners for calculating size distribution: https://superuser.com/questions/565443/generate-distribution-of-file-sizes-from-the-command-prompt |
That is possible for HTTP (with some refactoring) because the server has a way to specify the encoding, but not for gRPC bytestream where the server only returns data (not metadata). It might be something to reconsider for REAPIv3.
We could try using heuristics in a future hybrid storage mode. For example, any CAS blob under a certain size is stored uncompressed, and then only compressed for storage if it is downloaded a certain number of times (between bazel-remote restarts).
I can't think of a reason why the client would upload Action messages for cache-only mode. My guess is that it either simplifies the code on the client side, or it's a bug. |
I'm not sure, but one reason could be the following, which is valuable for me: Sometimes I analyze content in ac & cas, by starting from the ActionResult and following references from Action to Command and further, to get the associations between output files in CAS and their originating commands. In order answer questions like:
I have only done it in small caches, but I have ideas about a tool for creating index mapping original action output file names to Action digests. Then it hope being able to see patterns also in large caches. I though about generating such index from within bazel-remote, but I would like to avoid that overhead and instead only generating it on demand when needed. |
75094b0 to
c199b08
Compare
|
The proxy backends should work now, and I'm not aware of any other blockers for landing this. There is one limitation that might be annoying- it's not possible to use storage_mode zstd and proxy to another bazel-remote instance via http, but I think that can be implemented later. |
|
Thanks for doing this. |
|
@tschuett: I have some ideas for "hybrid" storage modes that we can experiment with later, which might cover that scenario if it occurs often enough. Hopefully reads from offsets are rare though. |
|
Thanks. It will be interesting to see the actual Bazel implementation. |
|
@ulrfa: I decided to try to squeeze in one last change while changing the on-disk format. The latest commit uses psuedo-random strings in the blob filenames, so there's no canonical file for each hash that needs to be replaced while holding the LRU index lock. To avoid incomplete files being used when restarting bazel-remote, I use the setgid permission bit- it's set when creating the file and un-set once the file has been written successfully. Then on startup we discard any files with the setgid bit set. |
@mostynb, thanks for considering the on-disk format for the locks! I wonder what happens in the following scenario, with two PUT and one GET request, all with the same key: Request1-PUT: Writes file and commits to LRU. |
|
So request 2 gets the first version of the file and has a valid/usable file handle, then that file is unlinked due to request 3, but request 2 progresses as usual. |
@mostynb, to me it seems like the mutex is released before opening the file. Right? bazel-remote/cache/disk/disk.go Lines 689 to 696 in 33dfce6 I think we decided to release the lock before os.Open on purpose, to avoid holding the mutex while doing frequent filesystem syscalls. And we considered it OK that os.Open could fail if an old entry was evicted. But with the latest change also recently written files are deleted when replacing lru item with same key. |
Right- good catch. There's also another bug which masks this one: I forgot to make One solution to the first bug might be to try to open the file without holding the lock, and if the file does not exist anymore, try again while holding the lock. For the second bug, we can either delete the old file while holding the lock, or spawn a goroutine to perform the deletion so we can continue and release the lock (possibly with a delay to increase the probability of hitting the fast path inside Get). I'll think this through some more, maybe there's a better solution. |
|
Here's another potential solution, reference counting Gets that haven't opened the file yet:
This will overflow if there are 4,294 million concurrent Gets (that haven't opened the file yet) for the same value, but I don't think we'd scale that high anyway. However, I think it would be better to start with the fallback slow path fix I mentioned the other day. ie if we fail to open the file, perform another LRU lookup and open the file while holding the lock (and remove files when replacing items in the index- also after refactoring, so it can be done without holding the lock). Maybe that will be fast enough. |
The default is to use zstd, but some installs might choose to use uncompressed data (for example if clients mostly upload blobs which are already compressed).
Migrating old data files to the new format is too slow, we will need to add back support for raw uncompressed files.
The on-disk format seems to be settled now, and migration should be as fast as is reasonably possible. I no longer consider it a risky move to migrate old caches.
…alue This could put significant load on the backend server. Instead, just return "unknown size" and rely on the SHA256 key by itself to avoid collisions.
By using a zstd skippable frame for our header, the file becomes a valid .zst file on disk. This makes it possible to inspect cache files with standard tools. The cost is only 8 bytes per (compressed) file. Compressed CAS blobs are now limited to ~511TB (with 1M chunks), which should not be an issue in practical use.
By doing this we can build the LRU index on startup using only the directory listings (ie without reading the header of compressed blobs). This should keep startup times low.
While preexisting legacy/.v1 CAS blobs were being imported and used correctly, newly uploaded blobs weren't being named correctly with --storage_mode uncompressed.
Unlike the other settings, these were previously only settable from a config file.
Since we have two different on-disk CAS blob formats, and we don't want to perform transcoding in the proxy backends, we need to use a different path for v2 CAS blobs. This means that for now, it's not possible to use storage_mode: zstd and use the httpproxy backend to proxy blobs to another bazel-remote instance. Also, when switching storage modes, stored blobs for the other storage mode on the proxy backends will not be used.
Rely on the sha256 hash alone, like we do in httpproxy.Contains. This avoids the need to update this code since we added the zstandard skippable framing, which I don't have time to debug right now.
…he LRU lock By adding a psuedo-random string to the filename for blobs and storing that string in the index, we no longer need to rename files while holding the LRU index lock. To avoid incomplete files from being reused after restarting bazel-remote, we use a bit of a hack: create the blob file with the setgid bit set, and un-set it once the blob is complete. On startup, remove any files with the setgid bit set.
…mments more explicit
If there is an error while reading the input file, we should pass that error on to readers so they know not to use incomplete files.
We can peek at the first part of the data to figure out the logical size, then let the disk cache layer decide if it wants to use the blob or not.
51a10d3 to
48e8649
Compare
This PR adds support for the compressed-blobs REAPI proposal.
By default, uploaded CAS blobs are stored in a file with a header and zstandard-compressed chunks (corresponding to 1M of raw data, initially). When clients request one of these items in zstandard compressed form, bazel-remote can stream data without recompression (or if the client requests data from an offset, at most one chunk will be recompressed).
Preexisting CAS blobs are not migrated to the compressed format, because this can take a lot of time for large caches. Instead, these blobs are moved into the new cache directory structure with a .v1 filename extension. If the storage mode is set to "uncompressed" then CAS blobs are always stored in this form. (I plan to investigate on-the-fly recompression for preexisting blobs later.)
The HTTP interface also supports zstandard compressed blobs via
Accept-Encoding: zstdfor GET andContent-Encoding: zstdand a customX-Digest-SizeBytesheader for PUT.Implements #12.