feat: N-dimensional raster type extension (Phase 1)#749
feat: N-dimensional raster type extension (Phase 1)#749james-willis wants to merge 15 commits intoapache:mainfrom
Conversation
Replace the legacy 2D raster schema with the N-dimensional layout: - Remove metadata sub-struct (width/height/upperleft/scale/skew) - Add transform: List<Float64> (6-element GDAL GeoTransform) - Add x_dim/y_dim: Utf8 for explicit spatial dimension declaration - Flatten band struct: name, dim_names, shape, data_type, nodata, strides, offset, outdb_uri, data - Remove StorageType enum (OutDb indicated by non-null outdb_uri) - Remove metadata_indices/band_metadata_indices modules - Update raster_indices and band_indices for new layout Downstream crates will not compile until subsequent commits update them to use the new schema.
Replace the legacy trait hierarchy with N-D raster types: - traits.rs: RasterRef (transform, x_dim, y_dim, width/height derived from band dims), BandRef (ndim, dim_names, shape, nd_buffer, contiguous_data returning Cow), NdBuffer struct - array.rs: RasterStructArray reads new schema (transform list, x_dim/y_dim, flattened band fields with nested lists for dim_names/shape/strides), RasterRefImpl with band_boxed() - builder.rs: start_raster/start_band with N-D params, plus start_raster_2d/start_band_2d convenience for legacy 2D usage - affine_transformation.rs: AffineMatrix::from_transform(&[f64]) replaces from_metadata(), free functions accept &dyn RasterRef - display.rs: updated for new trait interface Downstream crates will not compile until test utilities and RS_* functions are updated in subsequent commits.
… test utils Update sedona-raster core types and sedona-testing helpers for the N-D raster schema: sedona-raster: - traits.rs: RasterRef with transform/x_dim/y_dim/width/height, BandRef with ndim/dim_names/shape/nd_buffer/contiguous_data(Cow), NdBuffer struct. band() returns Box<dyn BandRef>. - array.rs: RasterStructArray reads new flattened schema with nested lists for dim_names/shape/strides - builder.rs: start_raster/start_band with N-D params, plus start_raster_2d/start_band_2d convenience methods - affine_transformation.rs: from_transform(&[f64]) replaces from_metadata(), free functions accept &dyn RasterRef - display.rs: updated for new trait interface sedona-testing: - All raster helpers updated to use new builder API - assert_raster_equal compares transforms, dims, shapes, data - generate_multi_band_raster uses start_band_2d Also fixes x_dim/y_dim schema to use Utf8View (matching builder). sedona-raster-functions not yet updated — next commit.
Mechanically migrate all 14 RS_* function files from the legacy raster API to the new N-D trait interface: - raster.metadata().width/height → raster.width/height().unwrap() - raster.metadata().upper_left_x/scale_x/etc → raster.transform()[i] - raster.bands().len/band(n) → raster.num_bands/band(n-1) - band.metadata().data_type/nodata/storage_type → band.data_type/nodata/outdb_uri - band.data() → band.contiguous_data() - AffineMatrix::from_metadata → from_transform - Remove StorageType, RasterMetadata, BandMetadata imports - Update all test helpers to use start_raster_2d/start_band_2d All 140 existing tests pass with identical outputs.
Add 7 tests covering new N-dimensional raster capabilities: - Non-standard spatial dim names (lon/lat): width()/height() work - Mixed dimensionality: 3D + 2D bands in same raster - dim_index()/dim_size() lookups including missing dims - contiguous_data() returns Cow::Borrowed in Phase 1 - NdBuffer strides correct for UInt8/Float64/UInt16 at various shapes - width()/height() returns None for raster with no bands - Band name nullable: named vs unnamed bands, out-of-range
Move the u32-to-BandDataType conversion from inline match in the array reader to BandDataType::try_from_u32() on the enum itself. Eliminates duplicated mapping logic.
- All 10 valid discriminants map correctly - Invalid values (0, 11, u32::MAX) return None - Round-trip: discriminant as u32 → try_from_u32 for all variants
- Fix needless borrows flagged by clippy - Suppress too_many_arguments on start_raster_2d (intentional API) - cargo fmt across all modified crates
Add outdb_uri parameter to RasterBuilder::start_band() so OutDb bands can be constructed. Restore the RS_BandPath outdb tests that were weakened during the migration — they now properly test the Some(uri) code path again.
Update test outdb_uri values to follow the design convention: geotiff://s3://bucket/file.tif#band=N
- Replace unwrap() on width()/height() with proper error handling in rs_convexhull, rs_envelope, rs_size, rs_spatial_predicates - Remove dead band_name_array field from BandRefImpl - Add debug_assert! bounds checks on AffineMatrix::from_transform and RasterRefImpl::transform() - Add finish_band() validation: exactly one data value per band - Add start_band_2d() validation: reject when width/height are 0 - Add band_by_name() default method on RasterRef for Zarr workflows
2b969d1 to
66b3c2e
Compare
Add parse_outdb_uri() utility that splits scheme://path#fragment into components. RS_BandPath now strips the internal scheme prefix and fragment, returning just the path portion to users — matching Sedona Spark's behavior where RS_BandPath returns a plain file path. Test outdb_uri values now use scheme-dispatched format (geotiff://s3://bucket/file.tif#band=1) while RS_BandPath output remains s3://bucket/file.tif.
paleolimbot
left a comment
There was a problem hiding this comment.
I'll take a closer look at the implementation tomorrow, but wanted to get a preliminary review of the schema out!
| Field::new(column::METADATA, Self::metadata_type(), false), | ||
| Field::new(column::CRS, Self::crs_type(), true), // Optional: may be inferred from data | ||
| Field::new(column::CRS, Self::crs_type(), true), | ||
| Field::new(column::TRANSFORM, Self::transform_type(), false), | ||
| Field::new(column::X_DIM, DataType::Utf8View, false), | ||
| Field::new(column::Y_DIM, DataType::Utf8View, false), | ||
| Field::new(column::BANDS, Self::bands_type(), true), | ||
| ]) | ||
| } | ||
|
|
||
| /// Raster metadata schema | ||
| pub fn metadata_type() -> DataType { | ||
| DataType::Struct(Fields::from(vec![ | ||
| // Raster dimensions | ||
| Field::new(column::WIDTH, DataType::UInt64, false), | ||
| Field::new(column::HEIGHT, DataType::UInt64, false), | ||
| // Geospatial transformation parameters | ||
| Field::new(column::UPPERLEFT_X, DataType::Float64, false), | ||
| Field::new(column::UPPERLEFT_Y, DataType::Float64, false), | ||
| Field::new(column::SCALE_X, DataType::Float64, false), | ||
| Field::new(column::SCALE_Y, DataType::Float64, false), | ||
| Field::new(column::SKEW_X, DataType::Float64, false), | ||
| Field::new(column::SKEW_Y, DataType::Float64, false), | ||
| ])) | ||
| /// Affine transform schema — 6-element GDAL GeoTransform: | ||
| /// `[origin_x, scale_x, skew_x, origin_y, skew_y, scale_y]` | ||
| pub fn transform_type() -> DataType { | ||
| DataType::List(FieldRef::new(Field::new("item", DataType::Float64, false))) | ||
| } |
There was a problem hiding this comment.
Just highlighting for myself + others that this is the main change.
The main conceptual change here seems to be that the height and width in 2D space are now members of the bands. Even though it might involve some duplication, I wonder if a top-level shape (whose ordering is always x, y, and possibly z in some future) would still be appropriate. The bands would then all be parsed according to the x/y dim name and validated against the top-level declaration.
X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists are kind of a pain in Rust but interacting with these in Python is fairly easy to do and this PR is probably the only one that needs to consider that detail.
There was a problem hiding this comment.
in zarr, each variable (here band) can have any permutation of dimensions. so band 1 might be xyzt, 2 might be yzt, 3 might be xy
| /// Individual band schema — flattened N-D band with dimension metadata | ||
| pub fn band_type() -> DataType { | ||
| DataType::Struct(Fields::from(vec![ | ||
| Field::new(column::METADATA, Self::band_metadata_type(), false), | ||
| Field::new(column::DATA, Self::band_data_type(), false), | ||
| Field::new(column::NAME, DataType::Utf8, true), | ||
| Field::new(column::DIM_NAMES, Self::dim_names_type(), false), | ||
| Field::new(column::SHAPE, Self::shape_type(), false), | ||
| Field::new(column::DATATYPE, DataType::UInt32, false), | ||
| Field::new(column::NODATA, DataType::Binary, true), | ||
| Field::new(column::STRIDES, Self::strides_type(), false), | ||
| Field::new(column::OFFSET, DataType::UInt64, false), | ||
| Field::new(column::OUTDB_URI, DataType::Utf8, true), | ||
| Field::new(column::DATA, DataType::BinaryView, false), |
There was a problem hiding this comment.
Highlighting changes here to check my understanding...the main change is that bands have their own dimensions, with the addition of dim_names. The spatial dimensions of all the bands have to match. Here outdb_uri typically refers to a portion of a raster (e.g., COG tile) rather than a full array.
This also introduces strides/offset, which comes from pybuffer ( https://docs.python.org/3/c-api/buffer.html ) terminology. This allows some (but not all) subsets to be propagated without modifying DATA or OUTDB_URI. In particular, because OUTDB_URI doesn't have to be modified, we can delay loading tiles for longer.
I am a tiny bit worried that this won't be sufficient to avoid loading all of the things we want to avoid loading, but also we maybe don't have the full picture on what the subset specification looks like yet. Maybe this can be subset: Utf8 and we can sort out the exact subset specifications we'll allow and how to resolve them later (e.g., GDAL has its own way to specify this as well).
paleolimbot
left a comment
There was a problem hiding this comment.
I did another round...there are a few places where tests or comments were removed that were still useful.
I do think it's worth iterating a tiny bit on the schema, and importantly, I think we should wait until @Kontinuation has finished the GDAL read portion of the previous set of raster refactoring because (1) this was a lot of work and Kristin's time is valuable and (2) these are important use cases to consider when making the new type work!
| // Check error handling for zero determinant | ||
| let bad_raster = TestRaster { | ||
| metadata: RasterMetadata { | ||
| width: 10, | ||
| height: 20, | ||
| upperleft_x: 100.0, | ||
| upperleft_y: 200.0, | ||
| scale_x: 1.0, | ||
| scale_y: 0.0, | ||
| skew_x: 0.0, | ||
| skew_y: 0.0, | ||
| }, | ||
| }; | ||
| let raster = TestRaster::new(100.0, 200.0, 1.0, -2.0, 0.25, 0.5); |
There was a problem hiding this comment.
For what it's worth the named struct construction here is more readable than the revised TestRaster::new()
| assert!(result | ||
| .err() | ||
| .unwrap() | ||
| .to_string() | ||
| .contains("determinant is zero.")); |
There was a problem hiding this comment.
Just checking that this expectation was moved and not deleted
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| use crate::array::RasterStructArray; | ||
| use sedona_testing::rasters::generate_test_rasters; | ||
|
|
||
| #[test] | ||
| fn display_non_skewed_raster() { |
There was a problem hiding this comment.
Are these tests moved to a different place or should they be re-added?
| /// TODO: For formats like Zarr that may need complex metadata (array path, chunk | ||
| /// coordinates, byte ranges), a simple key-value fragment may not be sufficient. | ||
| /// If this becomes a limitation, consider switching the fragment to a JSON object | ||
| /// or making the entire outdb_uri a JSON string for those formats. | ||
| #[derive(Debug, PartialEq)] | ||
| pub struct OutDbUri<'a> { | ||
| /// Loader scheme (e.g., "geotiff", "zarr") | ||
| pub scheme: &'a str, | ||
| /// External resource path (e.g., "s3://bucket/file.tif") | ||
| pub path: &'a str, | ||
| /// Loader-specific fragment (e.g., "band=1"), or None if absent | ||
| pub fragment: Option<&'a str>, | ||
| } |
There was a problem hiding this comment.
There's a sturcture for this already that we probably want to reuse (I think it's called the ObjectStoreUrl). It's based on the url crate I believe, which would also be an improvement over rolling our own here. I think this would give you the extra parameter information you'll need without having to maintain it here.
| /// Determine if this tile contains a corner of the overall grid and return its position | ||
| /// Returns Some(position) if this tile contains a corner, None otherwise |
There was a problem hiding this comment.
Is there a reason these comments were removed?
| fn test_generate_multi_band_raster() { | ||
| let struct_array = generate_multi_band_raster(); | ||
| let raster_array = RasterStructArray::new(&struct_array); | ||
| assert_eq!(raster_array.len(), 1); |
There was a problem hiding this comment.
Is this test no longer relevant?
Summary
Upgrades SedonaDB's raster type from 2D tiles to N-dimensional chunks with named dimensions, enabling support for climate model outputs, hyperspectral imagery, Zarr/NetCDF datacubes, and other multi-dimensional geospatial datasets.
Key changes
dim_names,shape,strides,offset; affine transform collapsed to singleList<Float64>; explicitx_dim/y_dimfields for spatial dimension declaration;outdb_urireplacesoutdb_url+outdb_band_id;StorageTypeenum removedRasterRef(transform, x_dim/y_dim, width/height derived from band dims) andBandRef(ndim, dim_names, shape, nd_buffer, contiguous_data returningCow<[u8]>)NdBufferstruct — exposes raw buffer + shape + strides + offset for zero-copy access (numpy/Arrow FFI boundary)start_raster_2d/start_band_2dfor legacy 2D usage; full N-Dstart_raster/start_bandfor multi-dimensional datadim_names=["y","x"],shape=[height, width]Crates modified
sedona-schema— N-D raster schema definitionsedona-raster— traits, Arrow array reader, builder, affine transform, displaysedona-testing— raster test helperssedona-raster-functions— all RS_* function implementationsTest plan
sedona-schema: 25 tests passsedona-raster: 24 tests pass (17 existing + 7 new N-D capability tests)sedona-raster-functions: 140 tests pass — all existing tests produce identical outputssedona-testing: 8 raster tests passcargo check --workspace)