WSSink: evolve the automatic schema as new data pours in by jpc · Pull Request #70 · HumeAI/wsds

jpc · 2026-04-14T08:53:26Z

This would simplify wsds/WSSink usage for one-off datasets (evals, synthetic data).

Currently we depend only on the first batch of data to determine the schema. When we detect new columns we silently ignore them(sic!) and when there are datatype changes we error out.

If we come up with a set of rules for automatic casting (PyArrow defaults do work but are not great in many cases) and other edge cases, we can rewrite the data written so far using a new schema and avoid most conflicts.

One especially nice improvement is automatic support for large binary without an explicit declaration.

WSSink: evolve the automatic schema as new data pours in

e871934

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSSink: evolve the automatic schema as new data pours in#70

WSSink: evolve the automatic schema as new data pours in#70
jpc wants to merge 1 commit into
jpc/async-pupyarrowfrom
jpc/wssink-schema-evolution

jpc commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jpc commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant