Conversation
🚀 Build Preview on IPFS ready
|
41b88fb to
38077b0
Compare
|
|
||
| The goal is to have a single content hash that represents a directory of files, such that verifying that hash verifies the entire contents. | ||
|
|
||
| This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. Content addressing solves this, but the choice of format has real consequences — particularly for overhead, determinism, language support and existing tooling, and whether you can fetch subsets without downloading the whole thing. These differences compound as dataset size grows: what's negligible at megabyte scale — a few extra bytes of framing, an extra round of parsing per block — becomes a meaningful cost at terabyte scale across millions of files. |
There was a problem hiding this comment.
Add note here about this creating a tar isn't practical for large datasets where you cannot afford to store two copies.
|
| │ "assets/style.css" │ each string is varint-length | ||
| │ "js/app.js" │ prefixed + raw UTF-8 bytes | ||
| │ "index.html" │ |
There was a problem hiding this comment.
The right edge is one space off. Same on the diagram below.
| - **Flat representation of trees.** Directory structure lives in the name strings as relative paths, not as separate directory entries with their own metadata. One entry per file, no ambiguity about empty directories or nested paths. | ||
| - **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning. | ||
| - **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob. | ||
| - **O(1) file lookup.** The root blob is a flat array of fixed-size 32-byte hashes, so finding the Nth file is a constant-time offset (`N * 32` bytes) with no parsing required. |
There was a problem hiding this comment.
When you already know at which offset the hash is. Practically you'd fetch the HashSeq, then the CollectMeta and scan though the strings to get the offset. It's still O(1), but it might be worth noting.
| - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive. | ||
| - **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs. | ||
| - **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec. | ||
| - **File-level subsetting only.** Individual files can be fetched and verified by their BLAKE3 hash, but there is no way to address a subdirectory as a unit. Fetching a subset means filtering the path list and requesting files one by one. |
There was a problem hiding this comment.
This is kind of the same point as the "Flat representation of trees" above. Maybe those points can be combined.
I haven't checked, but couldn't a HasSeq also link to another HashSeq? If yes then you could build directory structures, where you can directly link to a sub-directory.
| | Criteria | iroh collections | UnixFS | MASL/DRISL | | ||
| | -------------------- | ----------------------------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- | | ||
| | Encoding | Postcard (tag-free binary) | Protobuf (dag-pb) | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html) | | ||
| | Hash | BLAKE3 | Configurable (SHA-256 default) | Configurable via CID multihash | |
There was a problem hiding this comment.
DRISL only supports DASL-CIDs, hence only SHA-256.
| | Identifiers | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing) | CID (self-describing) | | ||
| | File lookup | O(1) offset from root | DAG traversal, depth varies | O(1) key lookup from root | | ||
| | Subsetting | Individual files only | Files and folders (subtree by CID) | Individual files only | | ||
| | Byte ranges | No (whole-file hashes) | Yes (chunked DAG allows partial reads) | No (whole-file CIDs) | |
There was a problem hiding this comment.
It isn't clear to me, what you mean with "Byte ranges". You can make range requests on the files. Subsets are addressed, implicitly due to Blake3.
Describe your changes
This is a new comparison guide of UnixFS, iroh collections, and DASL/MASL for content addressing directories of files, covering overhead, determinism, subsetting, and ecosystem support.
After spending a lot of time comparing the three for a different use case, I thought it would be useful to share these insights, and embrace the plurality of the IPFS ecosystem.
Checklist before merging