Skip to content

Add guide about content addressing folder #2262

Open
2color wants to merge 5 commits intomainfrom
content-addressing-folders
Open

Add guide about content addressing folder #2262
2color wants to merge 5 commits intomainfrom
content-addressing-folders

Conversation

@2color
Copy link
Member

@2color 2color commented Mar 6, 2026

Describe your changes

This is a new comparison guide of UnixFS, iroh collections, and DASL/MASL for content addressing directories of files, covering overhead, determinism, subsetting, and ecosystem support.

After spending a lot of time comparing the three for a different use case, I thought it would be useful to share these insights, and embrace the plurality of the IPFS ecosystem.

Checklist before merging

  • Passing all required checks (The beta Check Markdown links for modified files check is not required)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

🚀 Build Preview on IPFS ready

@2color 2color force-pushed the content-addressing-folders branch from 41b88fb to 38077b0 Compare March 6, 2026 13:36

The goal is to have a single content hash that represents a directory of files, such that verifying that hash verifies the entire contents.

This matters for build outputs, software distributions, large datasets, website archives — any case where you need to verify that a collection of files hasn't changed. A naive approach like hashing a tarball is fragile: tar archives encode metadata (timestamps, permissions, ordering) that vary between machines, producing different hashes for identical file contents. Content addressing solves this, but the choice of format has real consequences — particularly for overhead, determinism, language support and existing tooling, and whether you can fetch subsets without downloading the whole thing. These differences compound as dataset size grows: what's negligible at megabyte scale — a few extra bytes of framing, an extra round of parsing per block — becomes a meaningful cost at terabyte scale across millions of files.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add note here about this creating a tar isn't practical for large datasets where you cannot afford to store two copies.

@2color
Copy link
Member Author

2color commented Mar 6, 2026

  • Add more references to other pages like the lifecycle
  • add a whole opening section about the more abstract notion of a Merkle proof with link to protoschool. Use analogy of club bouncer who has the root hash and each person who has proof of name with real world ID, and the Merkle path to the root.

@2color 2color requested review from lidel and vmx March 6, 2026 18:53
Comment on lines +126 to +128
│ "assets/style.css" │ each string is varint-length
│ "js/app.js" │ prefixed + raw UTF-8 bytes
│ "index.html" │
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right edge is one space off. Same on the diagram below.

- **Flat representation of trees.** Directory structure lives in the name strings as relative paths, not as separate directory entries with their own metadata. One entry per file, no ambiguity about empty directories or nested paths.
- **Positional, tag-free encoding.** Postcard serializes fields in declaration order with no field numbers or type tags. The `"CollectionV0."` magic header handles versioning.
- **Compact.** The overhead per file is a varint-prefixed filename in the metadata blob and a 32-byte hash in the root blob.
- **O(1) file lookup.** The root blob is a flat array of fixed-size 32-byte hashes, so finding the Nth file is a constant-time offset (`N * 32` bytes) with no parsing required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you already know at which offset the hash is. Practically you'd fetch the HashSeq, then the CollectMeta and scan though the strings to get the offset. It's still O(1), but it might be worth noting.

- **Streaming verification.** The root blob is a hash sequence, so a verifier can check individual files incrementally as they arrive.
- **Ready-made distribution.** Collections can be distributed in a peer-to-peer fashion with iroh-blobs.
- **BLAKE3.** Fast (parallelizable, SIMD-accelerated), 256-bit digests, and adopted by the [BDASL](https://dasl.ing/bdasl.html) spec.
- **File-level subsetting only.** Individual files can be fetched and verified by their BLAKE3 hash, but there is no way to address a subdirectory as a unit. Fetching a subset means filtering the path list and requesting files one by one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of the same point as the "Flat representation of trees" above. Maybe those points can be combined.

I haven't checked, but couldn't a HasSeq also link to another HashSeq? If yes then you could build directory structures, where you can directly link to a sub-directory.

| Criteria | iroh collections | UnixFS | MASL/DRISL |
| -------------------- | ----------------------------------------------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------- |
| Encoding | Postcard (tag-free binary) | Protobuf (dag-pb) | [CBOR (deterministic subset with support for CIDs)](https://dasl.ing/drisl.html) |
| Hash | BLAKE3 | Configurable (SHA-256 default) | Configurable via CID multihash |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRISL only supports DASL-CIDs, hence only SHA-256.

| Identifiers | BLAKE3 hash (CID-encodable via `blake3` + `blake3_hashseq` multicodecs) | CID (self-describing) | CID (self-describing) |
| File lookup | O(1) offset from root | DAG traversal, depth varies | O(1) key lookup from root |
| Subsetting | Individual files only | Files and folders (subtree by CID) | Individual files only |
| Byte ranges | No (whole-file hashes) | Yes (chunked DAG allows partial reads) | No (whole-file CIDs) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear to me, what you mean with "Byte ranges". You can make range requests on the files. Subsets are addressed, implicitly due to Blake3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants