Skip to content

Validate key mismatches at the write-side#57

Closed
shahbaz-humeai wants to merge 3 commits into
mainfrom
shahbaz/wsds-validated-writes
Closed

Validate key mismatches at the write-side#57
shahbaz-humeai wants to merge 3 commits into
mainfrom
shahbaz/wsds-validated-writes

Conversation

@shahbaz-humeai
Copy link
Copy Markdown
Contributor

No description provided.

Enforce the hard invariant (all artifacts must share the same __key__
values) at write time, not just read time. When reference_keys is
provided to WSSink, each sample's __key__ is validated against the
expected key at that offset. Raises KeyMismatchError (per-sample) or
SampleCountMismatchError (on close) as fatal BaseExceptions that
prevent corrupt shards from being written.
Replace the reference_keys parameter on WSSink with validate_keys=True,
which mirrors the read side: automatically finds the smallest sibling
shard (by file size, skipping .wsds-link/.wsds-computed), reads its
__key__ column, and validates each written sample against it.

This matches how list_all_columns() sorts __key__ sources by ascending
file size to avoid loading heavy artifacts like audio.
Merge main's WSSink improvements (schema parameter, encode_value in
write(), better error truncation) with our write-time key validation.
Key validation runs before encode_value so it checks the raw sample.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant