Skip to content

Comments

Apply filters to variable-length data#6178

Merged
lrknox merged 13 commits intoHDFGroup:developfrom
crusaderky:ignore_filters
Feb 24, 2026
Merged

Apply filters to variable-length data#6178
lrknox merged 13 commits intoHDFGroup:developfrom
crusaderky:ignore_filters

Conversation

@crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Jan 28, 2026

  • Closes can_apply and set_local filter callbacks skipped for H5T_VARIABLE dtypes #5942. A more detailed explanation of the rationale for this PR is in the linked issue.
  • Mandatory filters can be now applied to variable-length data. This notably includes h5py object strings and NpyStrings.
  • Optional filters applied to variable-length data no longer skip can_apply and set_local, only to jump directly to filter. This fixes hdf5_blosc and HDF5_Blosc2, which rely on set_local. Note that all filters which have neither can_apply nor set_local have always worked fine with vlen data, as long as they were tagged as optional. Notable examples are deflate, lzf, bzip2 (from hdf5plugin), and lz4 (also from hdf5plugin).
  • Lossy filters, such as scale-offset, are and remain untested and most likely broken, as I expect them to treat the vlen metadata as numbers and corrupt it. The right place to ensure proper behaviour is their can_apply function, which however is broken (filters: can_apply does nothing for optional filters #6161).

Important

Enhance filter application to variable-length data, fixing optional filter handling and adding tests.

  • Behavior:
    • Mandatory filters can now be applied to variable-length data, including h5py strings.
    • Optional filters no longer skip can_apply and set_local steps, fixing issues with hdf5_blosc and HDF5_Blosc2.
    • Lossy filters remain untested and potentially broken for vlen data.
  • Code Changes:
    • Modify H5Z_ignore_filters() in H5Z.c to handle optional filters correctly.
    • Add tests in dsets.c for scalar and null datasets with optional filters, and for set_local updates with vlen data.
  • Tests:
    • Add test_optional_filters_scalar, test_optional_filters_null, test_set_local_updates_cd, test_set_local_updates_cd_vlen, and test_deflate_vlen in dsets.c.

This description was created by Ellipsis for 70ab78a. You can customize this summary. It will automatically update as commits are pushed.

} /* end test_optional_filters_scalar() */

/*-------------------------------------------------------------------------
* Function: test_optional_filters_null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was previously untested.

} /* end test_set_local_updates_cd() */

/*-------------------------------------------------------------------------
* Function: test_set_local_updates_cd_vlen
Copy link
Contributor Author

@crusaderky crusaderky Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails in the main branch, complaining that filters can't be applied to vlen data.
If I replace H5Z_FLAG_MANDATORY with H5Z_FLAG_OPTIONAL below, it crashes on assert(0) inside the filter callback, highlighting that set_local has not been called.

} /* end test_set_local_updates_cd_vlen() */

/*-------------------------------------------------------------------------
* Function: test_deflate_vlen
Copy link
Contributor Author

@crusaderky crusaderky Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests what happens when filter actually processes the contiguous raw chunk data.

This test fails in the main branch, complaining that filters can't be applied to vlen data.
However, if I replace H5Z_FLAG_MANDATORY with H5Z_FLAG_OPTIONAL below, it passes in the main branch, thanks to the fact that the deflate filter has no set_local callback.

Running deflate on more substantial data shows that the output file size is much smaller than the uncompressed version.

crusaderky added a commit to crusaderky/hdf5-pixi that referenced this pull request Jan 28, 2026
crusaderky added a commit to crusaderky/hdf5-blosc that referenced this pull request Jan 28, 2026
crusaderky added a commit to crusaderky/hdf5-blosc that referenced this pull request Jan 28, 2026
@crusaderky
Copy link
Contributor Author

Downstream tests in hdf5-blosc: Blosc/hdf5-blosc#38


/* Create dataset with optional filter */
if ((dsid = H5Dcreate2(file, DSET_OPTIONAL_SCALAR, strtid, sid, H5P_DEFAULT, dcplid, H5P_DEFAULT)) < 0)
/* Write data to the dataset */
Copy link
Contributor Author

@crusaderky crusaderky Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks that the filter callback is never invoked, which the previous version of this test left untested.

@hyoklee hyoklee added the Component - C Library Core C library issues (usually in the src directory) label Jan 28, 2026
@crusaderky
Copy link
Contributor Author

crusaderky commented Jan 29, 2026

@hyoklee is it possible to avoid having CI re-authorised by a maintainer every time I need to fix a trivial error? This is hurting velocity quite badly.

crusaderky added a commit to crusaderky/hdf5-pixi that referenced this pull request Jan 29, 2026
test/dsets.c Outdated
/* Create the datatype */
if ((strtid = H5Tcreate(H5T_STRING, H5T_VARIABLE)) < 0)
/* The filter is optional. Callbacks will crash if ever invoked. */
if (H5Pset_filter(dcplid, H5Z_FILTER_CRASH, H5Z_FLAG_OPTIONAL, 0, NULL) < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this filter registered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't. I think the test worked because hdf5 skips the whole filter stack due to the data type of the dataset. This highlights that H5Pset_filter is lacking input validation (out of scope for this PR).

Copy link
Member

@fortnern fortnern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Library changes look good, some comments about the tests

@fortnern
Copy link
Member

I've confirmed that disabling can_apply and set_local for vlens was a mistake

@fortnern
Copy link
Member

It makes sense you might want to compress a compound containing variable length data as well as fixed length data.

Absolutely! I would however leave it as a follow-up.

This should actually work fine with just this change. We can write a test for it at some point if it's not already tested (it's possible the test passes because it uses a filter that doesn't use can_apply or set_local).

@vchoi-hdfgroup vchoi-hdfgroup modified the milestones: HDF5 2.1.0, HDF5 2.2.0 Feb 20, 2026
@crusaderky
Copy link
Contributor Author

It makes sense you might want to compress a compound containing variable length data as well as fixed length data.

Absolutely! I would however leave it as a follow-up.

This should actually work fine with just this change. We can write a test for it at some point if it's not already tested (it's possible the test passes because it uses a filter that doesn't use can_apply or set_local).

My demo above does not change its output after this PR, so the bit that feeds the actual data through the filter is missing.

@crusaderky
Copy link
Contributor Author

@fortnern all comments have been addressed. This is ready for another round of review.

fortnern
fortnern previously approved these changes Feb 23, 2026
@crusaderky
Copy link
Contributor Author

@fortnern I applied the nit you suggested but it removed your approval

src/H5Z.c Outdated
*/
htri_t
H5Z_ignore_filters(hid_t dcpl_id, const H5T_t *type, const H5S_t *space)
H5Z_ignore_filters(hid_t dcpl_id, void H5_ATTR_UNUSED *type, const H5S_t *space)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
H5Z_ignore_filters(hid_t dcpl_id, void H5_ATTR_UNUSED *type, const H5S_t *space)
H5Z_ignore_filters(hid_t dcpl_id, const H5S_t *space)

You will also need to change the prototype declaration in H5Zprivate.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the invocation in H5D__create()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and pushed the changes myself, I hope that's ok. We're trying to get this in the release.

@nbagha1 nbagha1 modified the milestones: HDF5 2.2.0, HDF5 2.1.0 Feb 24, 2026
@lrknox lrknox merged commit b074586 into HDFGroup:develop Feb 24, 2026
238 of 239 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in HDF5 - TRIAGE & TRACK Feb 24, 2026
fortnern added a commit to fortnern/hdf5 that referenced this pull request Feb 25, 2026
fortnern added a commit that referenced this pull request Feb 25, 2026
lrknox pushed a commit to lrknox/hdf5 that referenced this pull request Feb 25, 2026
Closes can_apply and set_local filter callbacks skipped for H5T_VARIABLE dtypes HDFGroup#5942. A more detailed explanation of the rationale for this PR is in the linked issue.

Mandatory filters can be now applied to variable-length data. This notably includes h5py object strings and NpyStrings.

Optional filters applied to variable-length data no longer skip can_apply and set_local, only to jump directly to filter. This fixes hdf5_blosc and HDF5_Blosc2, which rely on set_local. Note that all filters which have neither can_apply nor set_local have always worked fine with vlen data, as long as they were tagged as optional. Notable examples are deflate, lzf, bzip2 (from hdf5plugin), and lz4 (also from hdf5plugin).

Lossy filters, such as scale-offset, are and remain untested and most likely broken, as I expect them to treat the vlen metadata as numbers and corrupt it. The right place to ensure proper behaviour is their can_apply function, which however is broken (filters: can_apply does nothing for optional filters HDFGroup#6161).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component - C Library Core C library issues (usually in the src directory)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

can_apply and set_local filter callbacks skipped for H5T_VARIABLE dtypes

7 participants