Skip to content

WSSink: evolve the automatic schema as new data pours in#70

Draft
jpc wants to merge 1 commit into
jpc/async-pupyarrowfrom
jpc/wssink-schema-evolution
Draft

WSSink: evolve the automatic schema as new data pours in#70
jpc wants to merge 1 commit into
jpc/async-pupyarrowfrom
jpc/wssink-schema-evolution

Conversation

@jpc
Copy link
Copy Markdown
Member

@jpc jpc commented Apr 14, 2026

This would simplify wsds/WSSink usage for one-off datasets (evals, synthetic data).

Currently we depend only on the first batch of data to determine the schema. When we detect new columns we silently ignore them(sic!) and when there are datatype changes we error out.

If we come up with a set of rules for automatic casting (PyArrow defaults do work but are not great in many cases) and other edge cases, we can rewrite the data written so far using a new schema and avoid most conflicts.

One especially nice improvement is automatic support for large binary without an explicit declaration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant