Skip to content

Support external writes to Iceberg tables using Postgres catalog#247

Draft
sfc-gh-mslot wants to merge 16 commits into
mainfrom
marcoslot/external-writes
Draft

Support external writes to Iceberg tables using Postgres catalog#247
sfc-gh-mslot wants to merge 16 commits into
mainfrom
marcoslot/external-writes

Conversation

@sfc-gh-mslot
Copy link
Copy Markdown
Collaborator

Background project I've had going on, needs a deeper look.

We expose Iceberg tables via the iceberg_tables view, which is supported by Spark and pyiceberg. However, we currently only support reads from that view. This PR adds support for writes to existing tables (i.e. updates on the view), by reading the metadata file being inserted, updating our internal metadata tables from it, and adjusting the schema if needed.

* Queue old files that are no longer referenced for deletion. Compare
* oldFileHash with newFileHash to find unreferenced files.
*/
if (oldFileHash != NULL && newFileHash != NULL)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks questionable, since we need to consider which snapshots are still held in the new metadata

Implements functionality to allow external Iceberg clients (Spark,
PyIceberg) to write data and metadata, then update the pg_lake catalog
by modifying iceberg_tables.metadata_location.

Key changes:
- Modified external_catalog_modification trigger to handle UPDATE on
  internal catalog tables, enabling external writes with optimistic
  concurrency control via previous_metadata_location
- Added sync_iceberg_metadata_from_external_write() function to sync
  pg_lake catalog state (schema, partition specs, data files) from
  externally-written metadata
- Introduced SkipIcebergDDLProcessing flag to prevent duplicate field_id
  registration when syncing external schema changes
- Made AddDataFileToTable() non-static for use by sync logic
- Added comprehensive test suite covering basic writes, schema changes,
  and concurrency control

External clients can now:
1. Write new data/metadata files to object storage
2. UPDATE iceberg_tables SET metadata_location = <new>,
   previous_metadata_location = <old>
3. pg_lake automatically syncs internal catalog from the new metadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
@sfc-gh-mslot sfc-gh-mslot force-pushed the marcoslot/external-writes branch from 6468828 to c8b21a1 Compare May 15, 2026 11:10
sfc-gh-mslot and others added 13 commits May 15, 2026 11:15
When external Iceberg clients write new metadata, old data files that
are no longer referenced need to be queued for deletion, similar to
how compaction handles file cleanup.

Key changes:
- Track old file paths before clearing the catalog during sync
- Track new file paths as we repopulate from external metadata
- Compare old vs new file sets to identify unreferenced files
- Queue unreferenced files for deletion using InsertDeletionQueueRecord
- Handle empty table case where all old files become unreferenced

This ensures external writes properly clean up replaced files through
the deletion queue, which VACUUM will eventually remove from object
storage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Use GetTransactionSnapshot() instead of GetActiveSnapshot() when
retrieving old data files for deletion queue processing.

GetActiveSnapshot() can fail if called outside of a snapshot context,
which was causing server crashes when the sync function was called
from the trigger. GetTransactionSnapshot() is safer as it creates
a snapshot if one doesn't exist.

This fixes server crashes during external writes from PyIceberg/Spark.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Call sync_iceberg_metadata_from_external_write directly as a C function
instead of via SPI_EXECUTE to avoid nested SPI connections.

The sync function uses SPI internally (via GetTableDataFilesFromCatalog
and other catalog functions), so calling it through SPI_EXECUTE creates
nested SPI calls which can cause crashes or undefined behavior.

Using DirectFunctionCall1 instead allows proper SPI nesting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Fixed two critical issues:

1. Hash table creation: Added HASH_STRINGS flag to hash_create calls
   in sync_external_metadata.c. Hash tables using string keys must
   specify HASH_STRINGS or they will fail with assertion errors.

2. Cross-extension function calls: Use OidFunctionCall1 with
   LookupFuncName to call sync_iceberg_metadata_from_external_write
   from pg_lake_iceberg. This avoids undefined symbol errors since
   the function is in a different shared library (pg_lake_table.so).

All 6 external write tests now pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Added 3 comprehensive tests to verify that unreferenced data files are
properly queued for deletion when external Iceberg clients write new
metadata:

1. test_external_write_deletion_queue_on_overwrite: Verifies that when
   PyIceberg overwrites a table, all old data files are added to the
   deletion queue with proper orphaned_at timestamps.

2. test_external_write_deletion_queue_on_empty: Verifies that when
   PyIceberg overwrites a table with empty data, ALL old data files
   are queued for deletion.

3. test_external_write_deletion_queue_only_old_files: Verifies that
   only old files (not new files) are queued, and that append operations
   don't queue any files for deletion.

All tests query lake_engine.deletion_queue to confirm proper file
lifecycle management. Tests verify:
- Correct file paths are queued
- orphaned_at timestamps are set
- New files are NOT queued
- Append operations don't trigger deletions

All 9 external write tests now pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Changed LookupFuncName call to use a separate argTypes array variable
instead of an inline compound literal (Oid[]){REGCLASSOID}.

This improves code readability and makes the indentation tool happy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Include utils/snapmgr.h to declare GetTransactionSnapshot() function.

This fixes the CI build error:
  implicit declaration of function 'GetTransactionSnapshot'

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
Fixed two issues with the external_catalog_modification trigger:

1. Changed error messages to match test expectations:
   - "modifying the internal catalog is currently only supported via pg_lake_iceberg tables"
   - This matches the expected error format in test_iceberg_catalog.py

2. Added validation to detect catalog_name changes during UPDATE:
   - Prevents changing a table from external catalog to internal catalog
   - Prevents changing a table from internal catalog to external catalog
   - These operations should only be done via CREATE/DROP TABLE

Fixes CI test failures in:
- test_create_in_internal_catalog
- test_create_in_external_catalog

All external write tests still pass (9/9).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Marco Slot <marco.slot@snowflake.com>
GetIcebergCatalogMetadataLocation is static in catalog.c; the trigger
must use the exported sibling, which dispatches to the same internal
helper for same-database internal-catalog UPDATEs.

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
The sync_iceberg_metadata_from_external_write CREATE FUNCTION was
landed in pg_lake_table--3.2--3.3.sql, but 3.3 has already shipped
(latest tag is v3.3.4). 3.4 is unreleased so its upgrade script is
fair game; move the CREATE FUNCTION there.

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
SyncDataFilesFromMetadata used to walk only the current snapshot's
manifests and queue every other previously-cataloged file for deletion.
Iceberg metadata typically retains historic snapshots, and external
clients may rely on them for time-travel reads — physically deleting
those files (which the deletion queue does after a 10-day retention
window) would corrupt those reads.

Instead, build the set of "still referenced" files from ALL retained
snapshots in the new metadata via IcebergFindAllReferencedFiles, and
queue old cataloged files only when they're absent from that set.

The catalog itself is still rebuilt from the current snapshot only, since
pg_lake doesn't expose time-travel reads — only the deletion-queue
half of the sync changes.

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
ProcessAlterTable already short-circuits when SkipIcebergDDLProcessing
is set, but rename goes through a different post-process path. Without
this guard, the external-write sync's RENAME COLUMN would trigger a
duplicate metadata write on top of the externally-supplied metadata.

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
…eue tests

Three new tests:
- test_external_write_preserves_history_files: PyIceberg append after
  pg_lake insert. Old data files must NOT be queued for deletion (still
  referenced by the retained pre-append snapshot).
- test_external_write_partition_spec_evolution: PyIceberg adds an
  identity partition; new spec must register and reads must work
  across both specs.
- test_external_write_rename_column: PyIceberg renames a column;
  field IDs are stable so the sync must rename in pg_class, not
  silently drop+add.

The three deletion-queue tests are updated to assert the corrected
semantic: overwrite/empty-overwrite/append+overwrite all retain the
prior snapshot(s), so files stay referenced and are NOT queued for
deletion. (Physical-deletion-of-orphans coverage requires snapshot
expiration, which pyiceberg's Python API does not expose; that path
is exercised in pg_lake's internal tests already.)

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
@sfc-gh-mslot sfc-gh-mslot force-pushed the marcoslot/external-writes branch from c8b21a1 to 0958ea0 Compare May 15, 2026 11:15
CI's lint-check-18 caught comment-reflow and continuation-indent issues
in the trigger handler, the sync, and the new pytest. Run `make reindent`
to bring them into line with the project's pgindent/black formatting.
No behavior change.

Signed-off-by: Marco Slot <marco.slot@snowflake.com>
…rnal-write tests

Adds 8 tests around the external-write path: INSERT/DELETE/catalog-flip/
unknown-table rejection in the iceberg_tables trigger, column stats
populated by sync, PyIceberg delete() through the position-deletes
branch, list-typed columns surviving the data-file resync, and a
schema-then-data sequence on the same table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant