Skip to content

Add support for the compressed-blobs REAPI proposal#386

Merged
mostynb merged 35 commits into
buchgr:masterfrom
mostynb:compressed_blobs_support
Mar 2, 2021
Merged

Add support for the compressed-blobs REAPI proposal#386
mostynb merged 35 commits into
buchgr:masterfrom
mostynb:compressed_blobs_support

Conversation

@mostynb
Copy link
Copy Markdown
Collaborator

@mostynb mostynb commented Jan 3, 2021

This PR adds support for the compressed-blobs REAPI proposal.

By default, uploaded CAS blobs are stored in a file with a header and zstandard-compressed chunks (corresponding to 1M of raw data, initially). When clients request one of these items in zstandard compressed form, bazel-remote can stream data without recompression (or if the client requests data from an offset, at most one chunk will be recompressed).

Preexisting CAS blobs are not migrated to the compressed format, because this can take a lot of time for large caches. Instead, these blobs are moved into the new cache directory structure with a .v1 filename extension. If the storage mode is set to "uncompressed" then CAS blobs are always stored in this form. (I plan to investigate on-the-fly recompression for preexisting blobs later.)

The HTTP interface also supports zstandard compressed blobs via Accept-Encoding: zstd for GET and Content-Encoding: zstd and a custom X-Digest-SizeBytes header for PUT.

Implements #12.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 3, 2021

@glukasiknuro, @ulrfa, @bdittmer, @buchgr: FYI- this is a significant change.

Once this lands and stabilises, I'll make a 2.0.0 release.

Comment thread cache/disk/disk.go
hash := ks[len(ks)-sha256.Size*2:]
var kind cache.EntryKind = cache.AC
if strings.HasPrefix(ks, "cas") {
kind = cache.CAS
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to handle uncompressed .v1 files.

I should also add back some eviction tests with uncompressed storage mode, since the size is predictable again.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy interface also needs fixing.

Copy link
Copy Markdown
Contributor

@ulrfa ulrfa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting @mostynb! I think the network bandwidth saving can be beneficial for my use cases! 🎉

Do you know if the bazel client, and others such as buildbarn, have plans to implement or make use of this feature?

Do you believe clients will typically use zstd for all their CAS uploads/downloads (including small files)? Do you expect transcoding to be common?

However, I wonder about reading from offset in compressed files. What use cases do you have in mind for that? It seems to me that it comes with drawbacks:

A) Need for header in each file.
B) Re-compression when receiving already compressed files via HTTP PUT.
C) Complexity.
D) Performance overhead (due to A and B).

For me it would be attractive to have the files stored without the extra header, and instead with a .zst file name extension (RFC 8478). Primarily for performance reasons, but also to allow digging into the cache directories with standard zstd command line tools. The logical size could also be embedded in the file name if needed.

Have you run performance benchmarks?

Comment thread cache/disk/disk.go Outdated
Comment thread cache/httpproxy/httpproxy.go Outdated
Comment thread cache/disk/disk.go Outdated
if err != nil {
return err
if !legacy {
size, err = casblob.GetLogicalSize(f.name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how long time it takes to restart bazel-remote, when data have to be read from each file?

An alternative could be to embed the logical size as part of the file name.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my largest bazel-remote instance that's running something close to this branch, restarting is fast enough. It's still growing though, so I can re-check once it has filled the space.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CAS in one of our cache instances is 1500 GB and contains 8 000 000 files.

I’m emulating restarting the cache by reading 8 bytes from them. Running “xargs head -c 8 > /dev/null” over them took 20 minutes. That is too long for us.

Re-running again with warm file system caches took 45 seconds.

(A bazel-remote instance was using the CAS at the same time)

Some of our caches are even larger. All of them are on SSD, but other people might have rotating disks where these 8 byte random-accesses takes even more time.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a 45 second restart for a large cache is reasonable, though not ideal.

A 20 minute cold start is pretty bad though. There is one other potential startup optimization that could help- loading file details concurrently (I know that @bdittmer has a local patch for this, which helps even at small scale). I wonder if you could try using xargs -P 72 in your benchmark (and maybe a couple of lower values too)?

(I don't think we should bother considering HDD-speeds, it's the wrong tool for the job and I would rather just recommend consumer SSD-like speed storage as a minimum.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

45 second restart time is longer than I would like in production. Because the longer time the cache is down, while restarting to upgrade version or change configuration, the higher risk for failing builds-without-the-bytes assumption about cache always available.

A first try with -P 72 took 4 minutes, but when re-trying with -P 72 more times, it took 1m45s, despite clearing filesystem cache with "sync; echo 3 > /proc/sys/vm/drop_caches" in between. Due to the difference, I don't fully trust the numbers. I wonder if there are some other caching layer involved, or if the first try coincided with peak load from the bazel-remote instance also running towards the disk. The fastest time I got was with -P 64, which was 1m 17s. If you are interested, I can list values for all -P values I tried.

But it come to my mind that there is one more large issue: Reading headers from files during startup, would touch file system access time, and that affects the ordering in LRU.

I believe all these issues can be avoided by simply storing the logical size embedded in the file name, like: cas/ef95d34495b36c0a6108c4e48679ebebfb1cee01364e3dceead1f95ae4276d2b_450.zst, if the logical size is 450 bytes and compressed with zstandard.

Or even better, I would prefer having both the logical size and the on-disk-size embedded in the file name. Since if there is a guaranteed one-to-one association between disk.lruItem entries and file names, then we can consider releasing the mutex during some syscalls in disk.go, as discussed in #323. Any changes to the mutexes could be considered in later separate work, but embedding both sizes, as preparation, would be easier in this PR.

Example zstandard compressed with logical size 450 bytes and on-disk-size 290 bytes:

cas/ef95d34495b36c0a6108c4e48679ebebfb1cee01364e3dceead1f95ae4276d2b_450_290.zst

Example non-compressed with both logical and on-disk size as 450 bytes:

cas/ef95d34495b36c0a6108c4e48679ebebfb1cee01364e3dceead1f95ae4276d2b_450

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those numbers look bad enough to give up on reading all the headers during loading. I will start refactoring.

I don't think the combination of logical size and on-disk size would be unique enough for the early mutex release idea- that would probably require storing a pseudo-random string or number in each lruItem.

Aside: using atimes for the initial LRU order is only approximate, because Contains updates the LRU but for efficiency reasons does not touch the filesystem (builds-without-the-bytes makes this worse). Reading data from file headers didn't make things much worse because the files were sorted by atime in one pass, then imported in that order in a second pass (but it added bias to the filesystem atime order for compressed CAS items). If the cache was running long enough then the atime order would become more similar to the LRU index order.

In the future, I might consider writing out the index order to disk prior to clean shutdown (and maybe also periodically) and then only use atimes as a fallback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the combination of logical size and on-disk size would be unique enough for the early mutex release

I believe it would be enough, since then it would not matter how concurrent PUT requests of same key are interleaved. Because all their LRU entry commits would be compatible with any of their files, and it would not matter if we end up with LRU commit from one request and file from another. PUT requests interleaved with evictions of same key, is more complex, but I have hope that is also possible without holding mutexes during file system syscalls.

Reading data from file headers didn't make things much worse because the files were sorted by atime in one pass, then imported in that order in a second pass

Our CAS file systems are mounted with the relatime option, so each access time is not updated more than once a day. That is normally not an issue, since files remains in our caches much longer than a day. But I suspect that same order in first and second pass would not have been sufficient with relatime.

Thanks for the refactoring!

Comment thread server/http.go
@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 8, 2021

Interesting @mostynb! I think the network bandwidth saving can be beneficial for my use cases! 🎉

Do you know if the bazel client, and others such as buildbarn, have plans to implement or make use of this feature?

Bazel plans to implement this: bazelbuild/bazel#12670
It looks like buildfarm might too: buildfarm/buildfarm#504

I don't think buildbarn is interested in implementing this for now (but I will offer to contribute patches later).

Do you believe clients will typically use zstd for all their CAS uploads/downloads (including small files)? Do you expect transcoding to be common?

It's entirely up to the client, so hard to predict (this is why I added a storage mode config option). However zstandard is fast enough that it might be OK to do a lot of transcoding, and there is still some benefit to be had from the increased effective cache size.

I suspect that bazel will add a simple on/off setting for using compressed-blobs (ie use it for all blobs, or for none), and maybe stop there.

However, I wonder about reading from offset in compressed files. What use cases do you have in mind for that? It seems to me that it comes with drawbacks:

A) Need for header in each file.
B) Re-compression when receiving already compressed files via HTTP PUT.
C) Complexity.
D) Performance overhead (due to A and B).

The use case is for resuming interrupted downloads of large blobs, without resorting to downloading in uncompressed form or decompressing the entire blob + recompressing the tail. With the chunked format, at most one chunk needs to be recompressed.

The server needs to decompress the blob after receiving it in order to verify the hash anyway, so it's just one extra compression, which should be OK if blobs are likely to be read multiple times.

For me it would be attractive to have the files stored without the extra header, and instead with a .zst file name extension (RFC 8478). Primarily for performance reasons, but also to allow digging into the cache directories with standard zstd command line tools. The logical size could also be embedded in the file name if needed.

There is a seekable zstandard format, but as far as I know nothing supports it yet. I would consider migrating to that later if it takes off.

In the meantime if needed we could provide a small utility which strips the header from a blob (the concatenated chunks can be processed by standard tools).

Have you run performance benchmarks?

Transferring large compressible blobs is much faster (of course), and I see cache sizes about 30% of the uncompressed size in my test cases (this will vary for different workloads).

@ulrfa
Copy link
Copy Markdown
Contributor

ulrfa commented Jan 9, 2021

However, I wonder about reading from offset in compressed files. What use cases do you have in mind for that? It seems to me that it comes with drawbacks:
A) Need for header in each file.
B) Re-compression when receiving already compressed files via HTTP PUT.
C) Complexity.
D) Performance overhead (due to A and B).

The use case is for resuming interrupted downloads of large blobs, without resorting to downloading in uncompressed form or decompressing the entire blob + recompressing the tail. With the chunked format, at most one chunk needs to be recompressed.

Are people often experiencing interrupted downloads of huge files from CAS? (Personally, I’m not)

I wonder how the performance is affected with huge amount of concurrent HTTP requests? I would be happy to run benchmarks with 72 core machines and 20 Gbps network. But how to best do that now without zstd support in bazel client?

Would it make sense to benchmark the following scenarios?

  • A. Non-compressed HTTP GET & PUT requests from bazel clients, to bazel-remote --storage_mode zstd

    • In order to detect overhead by transcoding
    • In order to detect overhead by io.Copy not being able to optimize with sendfile(2) since source is not an *os.File regular file.
  • B. Non-compressed HTTP GET & PUT requests from bazel clients, to bazel-remote --storage_mode uncompressed

    • In order to detect overhead by all the small file syscalls to read/write the header

@mostynb mostynb force-pushed the compressed_blobs_support branch from 1d664a1 to ad8de4d Compare January 10, 2021 14:14
@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 11, 2021

Are people often experiencing interrupted downloads of huge files from CAS? (Personally, I’m not)

I don't have an answer to this, because IIRC it's possible to download uncompressed blobs from an offset and therefore I'm unlikely to get complaints about it.

I wonder how the performance is affected with huge amount of concurrent HTTP requests? I would be happy to run benchmarks with 72 core machines and 20 Gbps network. But how to best do that now without zstd support in bazel client?

I think the best we can do in the meantime is try to put together an artificial stress-test setup using something like curl (recent versions support zstd downloads) or a custom client (which is fairly easy to write). Then we can ramp up the load with xargs or make or similar.

Would it make sense to benchmark the following scenarios?

  • A. Non-compressed HTTP GET & PUT requests from bazel clients, to bazel-remote --storage_mode zstd

    • In order to detect overhead by transcoding
    • In order to detect overhead by io.Copy not being able to optimize with sendfile(2) since source is not an *os.File regular file.

That is measurable today, with bazel configured to use remote cache via gRPC or HTTP. We should expect higher CPU and memory usage on the server, but hopefully a larger effective cache size.

  • B. Non-compressed HTTP GET & PUT requests from bazel clients, to bazel-remote --storage_mode uncompressed

    • In order to detect overhead by all the small file syscalls to read/write the header

There should be no extra load if --storage_mode uncompressed stores raw files with a ".v1" extension to distinguish them from the new header + compressed blob format. I added this for the "migrating existing files from the old directory structure" step, but I can't remember if I have added if for the Put code path yet. I'll follow up tomorrow.

Update: I need to fix the Put code path.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 12, 2021

BTW, I am also thinking through some --storage_mode hybrid ideas, but it's not something I want to attempt in the first PR. For example:

  • Store blobs in the form they were uploaded (uncompressed or in our compressed format).
  • Compress and replace uncompressed blobs if/when a client requests them in compressed form.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 12, 2021

Interesting, maybe we could implement the header as a skippable frame (or multiple skippable frames), in which case the entire compressed file would be decodable by any compliant zstandard decoder: https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#skippable-frames

The cost compared to the current header would only be 8 bytes per file for most files.

@mostynb mostynb force-pushed the compressed_blobs_support branch from b8b1f23 to ee44820 Compare January 15, 2021 08:25
@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 15, 2021

Pushed some new commits:

  1. Make the compressed CAS blob file format regular .zst (the header is now a zstd skippable frame).
  2. Include the logical size in the filename of compressed blobs, so we can import them from the directory listing alone (ie without reading the header of each blob).
  3. Fix the uploaded CAS blob filenames for --storage_mode uncompressed.

The first two changes break compatibility with cache directories created with earlier versions of this branch/PR.

(The proxy backends are still broken, that's next on my TODO list.)

@ulrfa
Copy link
Copy Markdown
Contributor

ulrfa commented Jan 17, 2021

BTW, I am also thinking through some --storage_mode hybrid ideas, but it's not something I want to attempt in the first PR. For example:

  • Store blobs in the form they were uploaded (uncompressed or in our compressed format).

Interesting idea!

The client might have additional information to make decision if certain files are worth compressing or not, then it could make sense to also store them in CAS as they were uploaded.

Or bazel-remote could decide to not compress files smaller than 4 KB? Since those would still consume a whole file system block anyway.

Below follows an example of the file size distribution in one of our caches.

Note that 60% of the files in that CAS have size of 128 bytes or less. Those should not be compressed.

Also note that 98% of the files would fit in a single 1M chunk. For them the seekable zstd table is unnecessary.

Size  Number of files in CAS
   8:       6
  16:   47690
  32:     100
  64:     537
 128: 6471070
 256:    5044
 512:   64129
  1k:  183429
  2k:  320129
  4k:  998385
  8k: 1611520
 16k:  264389
 32k:  269138
 64k:  177826
128k:  109303
256k:   96537
512k:   61308
  1M:   22527
  2M:   70088
  4M:   62483
  8M:   55450
 16M:    5043
 32M:    1657
 64M:      34
256M:       2

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 18, 2021

BTW, I am also thinking through some --storage_mode hybrid ideas, but it's not something I want to attempt in the first PR. For example:

  • Store blobs in the form they were uploaded (uncompressed or in our compressed format).

Interesting idea!

The client might have additional information to make decision if certain files are worth compressing or not, then it could make sense to also store them in CAS as they were uploaded.

Or bazel-remote could decide to not compress files smaller than 4 KB? Since those would still consume a whole file system block anyway.

Note that there are two other costs to consider: network transfer time (even if we didn't save disk space, we might save transfer time) and re-compression time (eg if new clients request all blobs in compressed form). This is something that I think we could experiment with in the "hybrid" storage mode.

Below follows an example of the file size distribution in one of our caches.

Interesting- can you share the language(s) in use here?

Note that 60% of the files in that CAS have size of 128 bytes or less. Those should not be compressed.

Also note that 98% of the files would fit in a single 1M chunk. For them the seekable zstd table is unnecessary.

Right, it's an optimization for large blobs.

@ulrfa
Copy link
Copy Markdown
Contributor

ulrfa commented Jan 18, 2021

Note that there are two other costs to consider: network transfer time (even if we didn't save disk space, we might save transfer time) and re-compression time (eg if new clients request all blobs in compressed form). This is something that I think we could experiment with in the "hybrid" storage mode.

Agree about experiments. For example I’m curious about how much the network transfer is affected by compressing small files that would have fit in a single 1500 byte ethernet MTU even as non-compressed.

Regarding re-compression: If bazel-remote decides to store some files uncompressed, then I assume bazel-remote would have the choice of transferring them uncompressed, even if client also accepts zstd encoding?

Below follows an example of the file size distribution in one of our caches.

Interesting- can you share the language(s) in use here?

I think it is mostly building C code and running tests with a proprietary test framework.

Many of the tiny files seems to be REv2 messages, such as Action messages that mainly contains only two Digests. That particular cache instance is not used for remote execution, but those messages are always uploaded by bazel client to CAS also for pure caching setups. However, they are only downloaded if using remote execution.

(Note that I mean the Action message in CAS, not the ActionResult message in AC)

One-liners for calculating size distribution: https://superuser.com/questions/565443/generate-distribution-of-file-sizes-from-the-command-prompt

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Jan 18, 2021

Note that there are two other costs to consider: network transfer time (even if we didn't save disk space, we might save transfer time) and re-compression time (eg if new clients request all blobs in compressed form). This is something that I think we could experiment with in the "hybrid" storage mode.

Agree about experiments. For example I’m curious about how much the network transfer is affected by compressing small files that would have fit in a single 1500 byte ethernet MTU even as non-compressed.

Regarding re-compression: If bazel-remote decides to store some files uncompressed, then I assume bazel-remote would have the choice of transferring them uncompressed, even if client also accepts zstd encoding?

That is possible for HTTP (with some refactoring) because the server has a way to specify the encoding, but not for gRPC bytestream where the server only returns data (not metadata). It might be something to reconsider for REAPIv3.

Below follows an example of the file size distribution in one of our caches.

Interesting- can you share the language(s) in use here?

I think it is mostly building C code and running tests with a proprietary test framework.

Many of the tiny files seems to be REv2 messages, such as Action messages that mainly contains only two Digests. That particular cache instance is not used for remote execution, but those messages are always uploaded by bazel client to CAS also for pure caching setups. However, they are only downloaded if using remote execution.

We could try using heuristics in a future hybrid storage mode. For example, any CAS blob under a certain size is stored uncompressed, and then only compressed for storage if it is downloaded a certain number of times (between bazel-remote restarts).

(Note that I mean the Action message in CAS, not the ActionResult message in AC)

I can't think of a reason why the client would upload Action messages for cache-only mode. My guess is that it either simplifies the code on the client side, or it's a bug.

@ulrfa
Copy link
Copy Markdown
Contributor

ulrfa commented Jan 19, 2021

I can't think of a reason why the client would upload Action messages for cache-only mode. My guess is that it either simplifies the code on the client side, or it's a bug.

I'm not sure, but one reason could be the following, which is valuable for me:

Sometimes I analyze content in ac & cas, by starting from the ActionResult and following references from Action to Command and further, to get the associations between output files in CAS and their originating commands. In order answer questions like:

  • How many alternative compiled object files are there in the CAS for a particular source file?
  • What changes typically cause cache misses?
  • Are there many similar commands resulting in same output file?
  • ...

I have only done it in small caches, but I have ideas about a tool for creating index mapping original action output file names to Action digests. Then it hope being able to see patterns also in large caches. I though about generating such index from within bazel-remote, but I would like to avoid that overhead and instead only generating it on demand when needed.

@mostynb mostynb force-pushed the compressed_blobs_support branch 3 times, most recently from 75094b0 to c199b08 Compare February 7, 2021 14:09
@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Feb 8, 2021

The proxy backends should work now, and I'm not aware of any other blockers for landing this. There is one limitation that might be annoying- it's not possible to use storage_mode zstd and proxy to another bazel-remote instance via http, but I think that can be implemented later.

@tschuett
Copy link
Copy Markdown

tschuett commented Feb 9, 2021

Thanks for doing this.
How about a second small lru cache. When somebody reads from an offset of a Zstd file, uncompress the file, put the uncompressed file into the second cache, and serve the data to the client. If somebody else reads from an offset from the same file, you can serve the data from the second cache.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Feb 9, 2021

@tschuett: I have some ideas for "hybrid" storage modes that we can experiment with later, which might cover that scenario if it occurs often enough. Hopefully reads from offsets are rare though.

@tschuett
Copy link
Copy Markdown

tschuett commented Feb 9, 2021

Thanks. It will be interesting to see the actual Bazel implementation.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Feb 14, 2021

@ulrfa: I decided to try to squeeze in one last change while changing the on-disk format. The latest commit uses psuedo-random strings in the blob filenames, so there's no canonical file for each hash that needs to be replaced while holding the LRU index lock.

To avoid incomplete files being used when restarting bazel-remote, I use the setgid permission bit- it's set when creating the file and un-set once the file has been written successfully. Then on startup we discard any files with the setgid bit set.

@ulrfa
Copy link
Copy Markdown
Contributor

ulrfa commented Feb 15, 2021

@ulrfa: I decided to try to squeeze in one last change while changing the on-disk format. The latest commit uses psuedo-random strings in the blob filenames, so there's no canonical file for each hash that needs to be replaced while holding the LRU index lock.

@mostynb, thanks for considering the on-disk format for the locks!

I wonder what happens in the following scenario, with two PUT and one GET request, all with the same key:

Request1-PUT: Writes file and commits to LRU.
Request2-GET: lru.Get returns item.random. But Request3 arrives before Request2 calls os.Open.
Request3-PUT: Writes file and commits to LRU. Will it replace the previous entry in LRU? Is that triggering an eviction of the file from Request1?
Request2-GET: Continues with os.Open. Will it fail to open the file from request1?

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Feb 15, 2021

availableOrTryProxy opens the file while holding the LRU lock and returns the *os.File, and the lock is also held whenever evicting items (somewhere inside Put or Get with a proxy download).

So request 2 gets the first version of the file and has a valid/usable file handle, then that file is unlinked due to request 3, but request 2 progresses as usual.

@ulrfa
Copy link
Copy Markdown
Contributor

ulrfa commented Feb 15, 2021

availableOrTryProxy opens the file while holding the LRU lock

@mostynb, to me it seems like the mutex is released before opening the file. Right?

c.mu.Unlock() // We expect a cache hit below.
locked = false
blobPath := path.Join(c.dir, c.FileLocation(kind, item.legacy, hash, item.size, item.random))
if !isSizeMismatch(size, item.size) {
var f *os.File
f, err = os.Open(blobPath)

I think we decided to release the lock before os.Open on purpose, to avoid holding the mutex while doing frequent filesystem syscalls. And we considered it OK that os.Open could fail if an old entry was evicted. But with the latest change also recently written files are deleted when replacing lru item with same key.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Feb 15, 2021

@mostynb, to me it seems like the mutex is released before opening the file. Right?

c.mu.Unlock() // We expect a cache hit below.
locked = false
blobPath := path.Join(c.dir, c.FileLocation(kind, item.legacy, hash, item.size, item.random))
if !isSizeMismatch(size, item.size) {
var f *os.File
f, err = os.Open(blobPath)

I think we decided to release the lock before os.Open on purpose, to avoid holding the mutex while doing frequent filesystem syscalls. And we considered it OK that os.Open could fail if an old entry was evicted. But with the latest change also recently written files are deleted when replacing lru item with same key.

Right- good catch. There's also another bug which masks this one: I forgot to make SizedLRU.Add remove files when replacing items.

One solution to the first bug might be to try to open the file without holding the lock, and if the file does not exist anymore, try again while holding the lock.

For the second bug, we can either delete the old file while holding the lock, or spawn a goroutine to perform the deletion so we can continue and release the lock (possibly with a delay to increase the probability of hitting the fast path inside Get).

I'll think this through some more, maybe there's a better solution.

@mostynb
Copy link
Copy Markdown
Collaborator Author

mostynb commented Feb 17, 2021

Here's another potential solution, reference counting Gets that haven't opened the file yet:

  • When adding an item to the LRU, initialise a uint32 counter in the lruItem value to 1 (while holding the lock).
  • When the value is evicted from the LRU index, perform an atomic decrement and remove the file if the resulting uint32 is 0 (this would require some refactoring to perform the file removals without holding the lock - perhaps by spawning a new goroutine per file removal).
  • When a Get for that item occurs, do an atomic increment, and defer a function which performs an atomic decrement and then removes the file if the resulting uint32 is 0 (the lock is released after deferring).

This will overflow if there are 4,294 million concurrent Gets (that haven't opened the file yet) for the same value, but I don't think we'd scale that high anyway.

However, I think it would be better to start with the fallback slow path fix I mentioned the other day. ie if we fail to open the file, perform another LRU lookup and open the file while holding the lock (and remove files when replacing items in the index- also after refactoring, so it can be done without holding the lock). Maybe that will be fast enough.

mostynb added 28 commits March 1, 2021 07:55
The default is to use zstd, but some installs might choose to use
uncompressed data (for example if clients mostly upload blobs which
are already compressed).
Migrating old data files to the new format is too slow, we will need
to add back support for raw uncompressed files.
The on-disk format seems to be settled now, and migration should be as
fast as is reasonably possible. I no longer consider it a risky move to
migrate old caches.
…alue

This could put significant load on the backend server. Instead, just return
"unknown size" and rely on the SHA256 key by itself to avoid collisions.
By using a zstd skippable frame for our header, the file becomes
a valid .zst file on disk. This makes it possible to inspect cache
files with standard tools.

The cost is only 8 bytes per (compressed) file. Compressed CAS blobs
are now limited to ~511TB (with 1M chunks), which should not be an
issue in practical use.
By doing this we can build the LRU index on startup using only the
directory listings (ie without reading the header of compressed
blobs). This should keep startup times low.
While preexisting legacy/.v1 CAS blobs were being imported and used
correctly, newly uploaded blobs weren't being named correctly with
--storage_mode uncompressed.
Unlike the other settings, these were previously only settable
from a config file.
Since we have two different on-disk CAS blob formats, and we don't want
to perform transcoding in the proxy backends, we need to use a different
path for v2 CAS blobs.

This means that for now, it's not possible to use storage_mode: zstd and
use the httpproxy backend to proxy blobs to another bazel-remote instance.

Also, when switching storage modes, stored blobs for the other storage
mode on the proxy backends will not be used.
Rely on the sha256 hash alone, like we do in httpproxy.Contains.
This avoids the need to update this code since we added the zstandard
skippable framing, which I don't have time to debug right now.
…he LRU lock

By adding a psuedo-random string to the filename for blobs and
storing that string in the index, we no longer need to rename
files while holding the LRU index lock.

To avoid incomplete files from being reused after restarting
bazel-remote, we use a bit of a hack: create the blob file with
the setgid bit set, and un-set it once the blob is complete.
On startup, remove any files with the setgid bit set.
If there is an error while reading the input file, we should pass that
error on to readers so they know not to use incomplete files.
We can peek at the first part of the data to figure out the logical
size, then let the disk cache layer decide if it wants to use the
blob or not.
@mostynb mostynb force-pushed the compressed_blobs_support branch from 51a10d3 to 48e8649 Compare March 1, 2021 07:17
@mostynb mostynb merged commit b193c83 into buchgr:master Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants