Skip to content

feat: N-dimensional raster type extension (Phase 1)#749

Open
james-willis wants to merge 15 commits intoapache:mainfrom
james-willis:jw/nd-raster-type
Open

feat: N-dimensional raster type extension (Phase 1)#749
james-willis wants to merge 15 commits intoapache:mainfrom
james-willis:jw/nd-raster-type

Conversation

@james-willis
Copy link
Copy Markdown
Contributor

@james-willis james-willis commented Apr 1, 2026

Summary

Upgrades SedonaDB's raster type from 2D tiles to N-dimensional chunks with named dimensions, enabling support for climate model outputs, hyperspectral imagery, Zarr/NetCDF datacubes, and other multi-dimensional geospatial datasets.

Key changes

  • New Arrow schema — flat band struct with dim_names, shape, strides, offset; affine transform collapsed to single List<Float64>; explicit x_dim/y_dim fields for spatial dimension declaration; outdb_uri replaces outdb_url + outdb_band_id; StorageType enum removed
  • New traitsRasterRef (transform, x_dim/y_dim, width/height derived from band dims) and BandRef (ndim, dim_names, shape, nd_buffer, contiguous_data returning Cow<[u8]>)
  • NdBuffer struct — exposes raw buffer + shape + strides + offset for zero-copy access (numpy/Arrow FFI boundary)
  • Builder conveniencestart_raster_2d/start_band_2d for legacy 2D usage; full N-D start_raster/start_band for multi-dimensional data
  • All 33 RS_ functions migrated* — mechanical migration, all produce identical outputs
  • Backward compatible — legacy 2D rasters represented as bands with dim_names=["y","x"], shape=[height, width]

Crates modified

  • sedona-schema — N-D raster schema definition
  • sedona-raster — traits, Arrow array reader, builder, affine transform, display
  • sedona-testing — raster test helpers
  • sedona-raster-functions — all RS_* function implementations

Test plan

  • sedona-schema: 25 tests pass
  • sedona-raster: 24 tests pass (17 existing + 7 new N-D capability tests)
  • sedona-raster-functions: 140 tests pass — all existing tests produce identical outputs
  • sedona-testing: 8 raster tests pass
  • Full workspace compiles (cargo check --workspace)
  • New N-D tests cover: non-standard spatial dim names (lon/lat), mixed dimensionality bands, dim_index/dim_size lookups, Cow::Borrowed verification, NdBuffer strides for various types/shapes, no-bands edge case, nullable band names

Replace the legacy 2D raster schema with the N-dimensional layout:

- Remove metadata sub-struct (width/height/upperleft/scale/skew)
- Add transform: List<Float64> (6-element GDAL GeoTransform)
- Add x_dim/y_dim: Utf8 for explicit spatial dimension declaration
- Flatten band struct: name, dim_names, shape, data_type, nodata,
  strides, offset, outdb_uri, data
- Remove StorageType enum (OutDb indicated by non-null outdb_uri)
- Remove metadata_indices/band_metadata_indices modules
- Update raster_indices and band_indices for new layout

Downstream crates will not compile until subsequent commits
update them to use the new schema.
Replace the legacy trait hierarchy with N-D raster types:

- traits.rs: RasterRef (transform, x_dim, y_dim, width/height
  derived from band dims), BandRef (ndim, dim_names, shape,
  nd_buffer, contiguous_data returning Cow), NdBuffer struct
- array.rs: RasterStructArray reads new schema (transform list,
  x_dim/y_dim, flattened band fields with nested lists for
  dim_names/shape/strides), RasterRefImpl with band_boxed()
- builder.rs: start_raster/start_band with N-D params, plus
  start_raster_2d/start_band_2d convenience for legacy 2D usage
- affine_transformation.rs: AffineMatrix::from_transform(&[f64])
  replaces from_metadata(), free functions accept &dyn RasterRef
- display.rs: updated for new trait interface

Downstream crates will not compile until test utilities and
RS_* functions are updated in subsequent commits.
… test utils

Update sedona-raster core types and sedona-testing helpers for the
N-D raster schema:

sedona-raster:
- traits.rs: RasterRef with transform/x_dim/y_dim/width/height,
  BandRef with ndim/dim_names/shape/nd_buffer/contiguous_data(Cow),
  NdBuffer struct. band() returns Box<dyn BandRef>.
- array.rs: RasterStructArray reads new flattened schema with
  nested lists for dim_names/shape/strides
- builder.rs: start_raster/start_band with N-D params, plus
  start_raster_2d/start_band_2d convenience methods
- affine_transformation.rs: from_transform(&[f64]) replaces
  from_metadata(), free functions accept &dyn RasterRef
- display.rs: updated for new trait interface

sedona-testing:
- All raster helpers updated to use new builder API
- assert_raster_equal compares transforms, dims, shapes, data
- generate_multi_band_raster uses start_band_2d

Also fixes x_dim/y_dim schema to use Utf8View (matching builder).

sedona-raster-functions not yet updated — next commit.
Mechanically migrate all 14 RS_* function files from the legacy
raster API to the new N-D trait interface:

- raster.metadata().width/height → raster.width/height().unwrap()
- raster.metadata().upper_left_x/scale_x/etc → raster.transform()[i]
- raster.bands().len/band(n) → raster.num_bands/band(n-1)
- band.metadata().data_type/nodata/storage_type → band.data_type/nodata/outdb_uri
- band.data() → band.contiguous_data()
- AffineMatrix::from_metadata → from_transform
- Remove StorageType, RasterMetadata, BandMetadata imports
- Update all test helpers to use start_raster_2d/start_band_2d

All 140 existing tests pass with identical outputs.
Add 7 tests covering new N-dimensional raster capabilities:

- Non-standard spatial dim names (lon/lat): width()/height() work
- Mixed dimensionality: 3D + 2D bands in same raster
- dim_index()/dim_size() lookups including missing dims
- contiguous_data() returns Cow::Borrowed in Phase 1
- NdBuffer strides correct for UInt8/Float64/UInt16 at various shapes
- width()/height() returns None for raster with no bands
- Band name nullable: named vs unnamed bands, out-of-range
Move the u32-to-BandDataType conversion from inline match in the
array reader to BandDataType::try_from_u32() on the enum itself.
Eliminates duplicated mapping logic.
- All 10 valid discriminants map correctly
- Invalid values (0, 11, u32::MAX) return None
- Round-trip: discriminant as u32 → try_from_u32 for all variants
- Fix needless borrows flagged by clippy
- Suppress too_many_arguments on start_raster_2d (intentional API)
- cargo fmt across all modified crates
Add outdb_uri parameter to RasterBuilder::start_band() so OutDb
bands can be constructed. Restore the RS_BandPath outdb tests
that were weakened during the migration — they now properly test
the Some(uri) code path again.
Update test outdb_uri values to follow the design convention:
geotiff://s3://bucket/file.tif#band=N
- Replace unwrap() on width()/height() with proper error handling
  in rs_convexhull, rs_envelope, rs_size, rs_spatial_predicates
- Remove dead band_name_array field from BandRefImpl
- Add debug_assert! bounds checks on AffineMatrix::from_transform
  and RasterRefImpl::transform()
- Add finish_band() validation: exactly one data value per band
- Add start_band_2d() validation: reject when width/height are 0
- Add band_by_name() default method on RasterRef for Zarr workflows
Add parse_outdb_uri() utility that splits scheme://path#fragment
into components. RS_BandPath now strips the internal scheme prefix
and fragment, returning just the path portion to users — matching
Sedona Spark's behavior where RS_BandPath returns a plain file path.

Test outdb_uri values now use scheme-dispatched format
(geotiff://s3://bucket/file.tif#band=1) while RS_BandPath output
remains s3://bucket/file.tif.
@james-willis james-willis marked this pull request as ready for review April 3, 2026 18:06
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a closer look at the implementation tomorrow, but wanted to get a preliminary review of the schema out!

Comment on lines -27 to 46
Field::new(column::METADATA, Self::metadata_type(), false),
Field::new(column::CRS, Self::crs_type(), true), // Optional: may be inferred from data
Field::new(column::CRS, Self::crs_type(), true),
Field::new(column::TRANSFORM, Self::transform_type(), false),
Field::new(column::X_DIM, DataType::Utf8View, false),
Field::new(column::Y_DIM, DataType::Utf8View, false),
Field::new(column::BANDS, Self::bands_type(), true),
])
}

/// Raster metadata schema
pub fn metadata_type() -> DataType {
DataType::Struct(Fields::from(vec![
// Raster dimensions
Field::new(column::WIDTH, DataType::UInt64, false),
Field::new(column::HEIGHT, DataType::UInt64, false),
// Geospatial transformation parameters
Field::new(column::UPPERLEFT_X, DataType::Float64, false),
Field::new(column::UPPERLEFT_Y, DataType::Float64, false),
Field::new(column::SCALE_X, DataType::Float64, false),
Field::new(column::SCALE_Y, DataType::Float64, false),
Field::new(column::SKEW_X, DataType::Float64, false),
Field::new(column::SKEW_Y, DataType::Float64, false),
]))
/// Affine transform schema — 6-element GDAL GeoTransform:
/// `[origin_x, scale_x, skew_x, origin_y, skew_y, scale_y]`
pub fn transform_type() -> DataType {
DataType::List(FieldRef::new(Field::new("item", DataType::Float64, false)))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just highlighting for myself + others that this is the main change.

The main conceptual change here seems to be that the height and width in 2D space are now members of the bands. Even though it might involve some duplication, I wonder if a top-level shape (whose ordering is always x, y, and possibly z in some future) would still be appropriate. The bands would then all be parsed according to the x/y dim name and validated against the top-level declaration.

X_DIM + Y_DIM should probably also be a list to allow for future Z. Lists are kind of a pain in Rust but interacting with these in Python is fairly easy to do and this PR is probably the only one that needs to consider that detail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in zarr, each variable (here band) can have any permutation of dimensions. so band 1 might be xyzt, 2 might be yzt, 3 might be xy

Comment on lines +57 to +68
/// Individual band schema — flattened N-D band with dimension metadata
pub fn band_type() -> DataType {
DataType::Struct(Fields::from(vec![
Field::new(column::METADATA, Self::band_metadata_type(), false),
Field::new(column::DATA, Self::band_data_type(), false),
Field::new(column::NAME, DataType::Utf8, true),
Field::new(column::DIM_NAMES, Self::dim_names_type(), false),
Field::new(column::SHAPE, Self::shape_type(), false),
Field::new(column::DATATYPE, DataType::UInt32, false),
Field::new(column::NODATA, DataType::Binary, true),
Field::new(column::STRIDES, Self::strides_type(), false),
Field::new(column::OFFSET, DataType::UInt64, false),
Field::new(column::OUTDB_URI, DataType::Utf8, true),
Field::new(column::DATA, DataType::BinaryView, false),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting changes here to check my understanding...the main change is that bands have their own dimensions, with the addition of dim_names. The spatial dimensions of all the bands have to match. Here outdb_uri typically refers to a portion of a raster (e.g., COG tile) rather than a full array.

This also introduces strides/offset, which comes from pybuffer ( https://docs.python.org/3/c-api/buffer.html ) terminology. This allows some (but not all) subsets to be propagated without modifying DATA or OUTDB_URI. In particular, because OUTDB_URI doesn't have to be modified, we can delay loading tiles for longer.

I am a tiny bit worried that this won't be sufficient to avoid loading all of the things we want to avoid loading, but also we maybe don't have the full picture on what the subset specification looks like yet. Maybe this can be subset: Utf8 and we can sort out the exact subset specifications we'll allow and how to resolve them later (e.g., GDAL has its own way to specify this as well).

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another round...there are a few places where tests or comments were removed that were still useful.

I do think it's worth iterating a tiny bit on the schema, and importantly, I think we should wait until @Kontinuation has finished the GDAL read portion of the previous set of raster refactoring because (1) this was a lot of work and Kristin's time is valuable and (2) these are important use cases to consider when making the new type work!

Comment on lines -247 to +233
// Check error handling for zero determinant
let bad_raster = TestRaster {
metadata: RasterMetadata {
width: 10,
height: 20,
upperleft_x: 100.0,
upperleft_y: 200.0,
scale_x: 1.0,
scale_y: 0.0,
skew_x: 0.0,
skew_y: 0.0,
},
};
let raster = TestRaster::new(100.0, 200.0, 1.0, -2.0, 0.25, 0.5);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth the named struct construction here is more readable than the revised TestRaster::new()

Comment on lines -327 to -331
assert!(result
.err()
.unwrap()
.to_string()
.contains("determinant is zero."));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking that this expectation was moved and not deleted

Comment on lines -117 to -124
#[cfg(test)]
mod tests {
use super::*;
use crate::array::RasterStructArray;
use sedona_testing::rasters::generate_test_rasters;

#[test]
fn display_non_skewed_raster() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests moved to a different place or should they be re-added?

Comment on lines +29 to +41
/// TODO: For formats like Zarr that may need complex metadata (array path, chunk
/// coordinates, byte ranges), a simple key-value fragment may not be sufficient.
/// If this becomes a limitation, consider switching the fragment to a JSON object
/// or making the entire outdb_uri a JSON string for those formats.
#[derive(Debug, PartialEq)]
pub struct OutDbUri<'a> {
/// Loader scheme (e.g., "geotiff", "zarr")
pub scheme: &'a str,
/// External resource path (e.g., "s3://bucket/file.tif")
pub path: &'a str,
/// Loader-specific fragment (e.g., "band=1"), or None if absent
pub fragment: Option<&'a str>,
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a sturcture for this already that we probably want to reuse (I think it's called the ObjectStoreUrl). It's based on the url crate I believe, which would also be an improvement over rolling our own here. I think this would give you the extra parameter information you'll need without having to maintain it here.

Comment on lines -295 to -296
/// Determine if this tile contains a corner of the overall grid and return its position
/// Returns Some(position) if this tile contains a corner, None otherwise
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these comments were removed?

Comment on lines -601 to -604
fn test_generate_multi_band_raster() {
let struct_array = generate_multi_band_raster();
let raster_array = RasterStructArray::new(&struct_array);
assert_eq!(raster_array.len(), 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test no longer relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants