Add guide about content addressing folder #2262

Open

2color wants to merge 5 commits intomainfrom

content-addressing-folders

Member

2color commented Mar 6, 2026 •

edited

Loading

Describe your changes

This is a new comparison guide of UnixFS, iroh collections, and DASL/MASL for content addressing directories of files, covering overhead, determinism, subsetting, and ecosystem support.

After spending a lot of time comparing the three for a different use case, I thought it would be useful to share these insights, and embrace the plurality of the IPFS ecosystem.

Checklist before merging

Passing all required checks (The beta Check Markdown links for modified files check is not required)

2color added 2 commits

March 6, 2026 14:11


          Add new guide to content addressing directories

0496dc0


          Add point about gateway support

Contributor

github-actions bot commented Mar 6, 2026 •

edited

Loading

🚀 Build Preview on IPFS ready

🔎 Commit: 38077b0
🔏 CID bafybeib3zccpg3gfqovcbl62sob7lhrt5a2hev63dcrhzuxoplbniw3o5m
📦 Preview:
- dweb.link
- inbrowser.link

2color added 3 commits

March 6, 2026 14:19


          Add words to vocab list

7f31b30


          Refine comparison

8f93195


          Refine iroh collections

38077b0

2color force-pushed the content-addressing-folders branch from 41b88fb to 38077b0 Compare

March 6, 2026 13:36

2color commented

View reviewed changes

docs/how-to/content-addressed-folders.md


		The goal is to have a single content hash that represents a directory of files, such that verifying that hash verifies the entire contents.

		This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. Content addressing solves this, but the choice of format has real consequences — particularly for overhead, determinism, language support and existing tooling, and whether you can fetch subsets without downloading the whole thing. These differences compound as dataset size grows: what's negligible at megabyte scale — a few extra bytes of framing, an extra round of parsing per block — becomes a meaningful cost at terabyte scale across millions of files.

Member Author

2color Mar 6, 2026

Add note here about this creating a tar isn't practical for large datasets where you cannot afford to store two copies.

Member Author

2color commented Mar 6, 2026

Add more references to other pages like the lifecycle
add a whole opening section about the more abstract notion of a Merkle proof with link to protoschool. Use analogy of club bouncer who has the root hash and each person who has proof of name with real world ID, and the Merkle path to the root.

2color requested review from lidel and vmx

March 6, 2026 18:53

vmx reviewed

View reviewed changes

docs/how-to/content-addressed-folders.md

Comment on lines +126 to +128

+              │   "assets/style.css"        │  each string is varint-length
+              │   "js/app.js"               │  prefixed + raw UTF-8 bytes
+              │   "index.html"              │

Member

vmx Mar 9, 2026

The right edge is one space off. Same on the diagram below.

docs/how-to/content-addressed-folders.md

+              - **Flat representation of trees.** Directory structure lives in the name strings as relative paths, not as separate directory entries with their own metadata. One entry per file, no ambiguity about empty directories or nested paths.
+              - **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning.
+              - **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob.
+              - **O(1) file lookup.** The root blob is a flat array of fixed-size 32-byte hashes, so finding the Nth file is a constant-time offset (`N * 32` bytes) with no parsing required.

Member

vmx Mar 9, 2026

When you already know at which offset the hash is. Practically you'd fetch the HashSeq, then the CollectMeta and scan though the strings to get the offset. It's still O(1), but it might be worth noting.

docs/how-to/content-addressed-folders.md

+              - **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive.
+              - **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs.
+              - **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec.
+              - **File-level subsetting only.** Individual files can be fetched and verified by their BLAKE3 hash, but there is no way to address a subdirectory as a unit. Fetching a subset means filtering the path list and requesting files one by one.

Member

vmx Mar 9, 2026

This is kind of the same point as the "Flat representation of trees" above. Maybe those points can be combined.

I haven't checked, but couldn't a HasSeq also link to another HashSeq? If yes then you could build directory structures, where you can directly link to a sub-directory.

docs/how-to/content-addressed-folders.md

+              | Criteria             | iroh collections                                                        | UnixFS                                   | MASL/DRISL                                                                            |
+              | -------------------- | ----------------------------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- |
+              | Encoding             | Postcard (tag-free binary)                                              | Protobuf (dag-pb)                        | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html)      |
+              | Hash                 | BLAKE3                                                                  | Configurable (SHA-256 default)           | Configurable via CID multihash                                                        |

Member

vmx Mar 9, 2026

DRISL only supports DASL-CIDs, hence only SHA-256.

docs/how-to/content-addressed-folders.md

+              | Identifiers          | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing)                    | CID (self-describing)                                                                 |
+              | File lookup          | O(1) offset from root                                                   | DAG traversal, depth varies              | O(1) key lookup from root                                                             |
+              | Subsetting           | Individual files only                                                   | Files and folders (subtree by CID)       | Individual files only                                                                 |
+              | Byte ranges          | No (whole-file hashes)                                                  | Yes (chunked DAG allows partial reads)   | No (whole-file CIDs)                                                                  |

Member

vmx Mar 9, 2026

It isn't clear to me, what you mean with "Byte ranges". You can make range requests on the files. Subsets are addressed, implicitly due to Blake3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet