Skip to content

Conversation

@amstokely
Copy link
Contributor

@amstokely amstokely commented Dec 12, 2025

This PR increases the default Parallel NetCDF (PnetCDF) header size to 128 KB to reduce the likelihood of header reallocation during MPAS I/O. In certain situations, when MPAS overwrites existing string attributes or variables with larger values, the NetCDF header can grow beyond its preallocated padding, requiring PnetCDF to reallocate the header during ncmpi_enddef, which can lead to an I/O hang. This behavior was identified as the root cause of the hang reported in MPAS-Workflow issue #384. By increasing the default header size, this PR decreases the likelihood that header reallocation is required when string attributes or variables are overwritten with larger values. Preliminary testing of the calculation that previously triggered the hang indicates that this change resolves the issue without impacting calculation results or I/O performance.

Fixes #1385

This reduces the likelihood of header reallocation as metadata grows.
Header reallocation is extremely expensive and can significantly degrade
parallel I/O performance.
@amstokely amstokely changed the base branch from master to develop December 12, 2025 21:49
@amstokely amstokely marked this pull request as ready for review December 12, 2025 21:51
@wkliao
Copy link

wkliao commented Dec 15, 2025

(Sorry to jump into the discussion.)
There is another way to do this without creating an MPI_Info object, which is to call ncmpi__enddef(), i.e.

ierr = ncmpi__enddef(file->ncidp, 0, 131072, 0, 0);

You can also set argument h_minfree, which ensures at least this amount of free space in the header section.

FYI. When adding new data objects into an existing file which causes the file header section to grow, PnetCDF must move the data section to a place with a higher file offset, which can be expensive, especially when the size of exiting file is large. The application program most likely just ran very slowly, but not hanging.

@mgduda
Copy link
Contributor

mgduda commented Dec 15, 2025

@wkliao Thanks very much for the suggestion to consider ncmpi__enddef as an alternative to passing a hint to ncmpi_create. I do like that this approach would require fewer lines of new code.

If we were to open an existing file for writing and call ncmpi___enddef (even if we didn't define any new dimensions or variables), would a header reallocation always be forced in case the original header was less than the size of the v_align argument to nmcpi__enddef?

@amstokely
Copy link
Contributor Author

(Sorry to jump into the discussion.) There is another way to do this without creating an MPI_Info object, which is to call ncmpi__enddef(), i.e.

ierr = ncmpi__enddef(file->ncidp, 0, 131072, 0, 0);

You can also set argument h_minfree, which ensures at least this amount of free space in the header section.

FYI. When adding new data objects into an existing file which causes the file header section to grow, PnetCDF must move the data section to a place with a higher file offset, which can be expensive, especially when the size of exiting file is large. The application program most likely just ran very slowly, but not hanging.

@wkliao This issue only surfaced when running MPAS through MPAS-JEDI under very specific conditions, which made it difficult to track down. Is there a check you’re aware of that we could have put in place to make this easier to catch?

@wkliao
Copy link

wkliao commented Dec 15, 2025

Let me use two PnetCDF terminologies to help explain.

  • "header extent" is referred to as the size of file header section and
  • "header size" is the true size of header (i.e. metadata).

Subtracting the two gives you the free space available in the header section.

If we were to open an existing file for writing and call ncmpi___enddef (even if we didn't define any new dimensions or variables), would a header reallocation always be forced in case the original header was less than the size of the v_align argument to nmcpi__enddef?

In this case, PnetCDF will check if the file extent of the existing file is aligned with v_align. If it is not, PnetCDF will move the data section to a high file offset that is aligned with v_align.

Please note that v_align is not the size of file extent. It is used to align the header extent to a multiple of v_align. For example, say the file header size is 2000 bytes and v_align is 1024, then the file extent will become 2048.

@wkliao
Copy link

wkliao commented Dec 15, 2025

Is there a check you’re aware of that we could have put in place to make this easier to catch?

Two PnetCDF APIs can be used to query the header extent and header size of an existing file. They can also be called after ncmpi_enddef():

If the free space is sufficiently large, then just call ncmpi_enddef().
Otherwise, moving the data section is inevitable.

@amstokely amstokely requested a review from mgduda December 19, 2025 17:21
int io_group;
MPI_Comm io_file_comm;
MPI_Comm io_group_comm;
MPI_Info info = MPI_INFO_NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that there's any benefit to declaring info in this scope and using it in all calls to ncmpi_open. I'd suggest again that we declare this variable in the block beginning on line 323 (325 in this PR) and apply info only when creating new files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here was to ensure that the info variable is always in a valid state (which I think is important when possible) and to improve code consistency, so that all calls to ncmpi_open use the same MPI_Info variable. This also makes it easier to pass a non-null MPI_Info object to other ncmpi_open calls in the future, if needed. That said, I don’t feel strongly about this change and am happy to revert it if you prefer the previous approach to calling ncmpi_open.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our intent is only to pass a hint to ncmpi_create, and not to change the semantics of the calls to ncmpi_open for writing an existing file or reading a file, I think it would be much cleaner and clearer if we only use info within the scope of the block containing the call to ncmpi_create.

If in future we decide to pass a non-null info to either of the ncmpi_open calls, we can decide at that time how to handle the MPI_Info instance.

int io_group;
MPI_Comm io_file_comm;
MPI_Comm io_group_comm;
MPI_Info info = MPI_INFO_NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our intent is only to pass a hint to ncmpi_create, and not to change the semantics of the calls to ncmpi_open for writing an existing file or reading a file, I think it would be much cleaner and clearer if we only use info within the scope of the block containing the call to ncmpi_create.

If in future we decide to pass a non-null info to either of the ncmpi_open calls, we can decide at that time how to handle the MPI_Info instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants