Skip to content

Feat/cmip7 awiesm3 veg hr#266

Open
JanStreffing wants to merge 303 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr
Open

Feat/cmip7 awiesm3 veg hr#266
JanStreffing wants to merge 303 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr

Conversation

@JanStreffing
Copy link
Copy Markdown
Contributor

@JanStreffing JanStreffing commented Apr 2, 2026

CMIP7 cmorization for AWI-ESM3-VEG-HR

Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.

Key changes

CMIP7 data request

  • Load DataRequest from CMIP7_DReq_metadata JSON instead of cmip6 tables
  • Refactor to native compound-name architecture (ocean.tos.tavg-u-hxy-sea.mon.GLB)
  • Fix JSON key mismatch: cmip6_tablecmip6_cmor_table in vendored metadata
  • Improve compound_name matching against cmip6_compound_name and cmip7_compound_name attributes
  • Derive table_id from compound name when not set explicitly
  • Strict ValueError on zero DRV matches (instead of silent skip)

Pipeline

  • Add generic vertical_integrate custom pipeline step
  • Remove duplicate convert() step from DefaultPipeline
  • Fix Prefect State objects not being unwrapped to actual results in parallel runs
  • Propagate pipeline/flow errors instead of silently logging them

Standard library

  • Add time bounds support (src/pycmor/std_lib/time_bounds.py)
  • Fix dimension mapping to use getattr + _pycmor_cfg fallback
  • Fix global_attributes to derive table_id from CMIP6/CMIP7 compound names

Xarray accessor API

  • Lazy accessor registration and StdLibAccessor with .process()

Test infrastructure

  • Modernize with entry-point model discovery (pycmor.fixtures.model_runs)
  • Add pycmor.tutorial dataset system (xarray.tutorial-style API)
  • Fix stub generator to use monotonic coordinate values for multi-file datasets

Misc fixes

  • Python 3.9 entry_points() compatibility
  • Guard pyfesom2 imports for environments without it
  • Fix tarball double-nesting extraction on Python 3.12+
  • Rename non-standard time dimension on load (OpenIFS support)

Test plan

  • Unit tests pass: pytest tests/unit/
  • pycmor process examples/awiesm3-cmip7-minimal.yaml runs successfully on Levante
  • core_atm runs with tco95_core2 on Levante
  • core_land runs with tco95_core2 on Levante
  • core_ocean runs with tco95_core2 on Levante
  • core_seaice runs with tco95_core2 on Levante
  • lrcs_land runs with tco95_core2 on Levante
  • lrcs_oce runs with tco95_core2 on Levante
  • lrcs_seaice runs with tco95_core2 on Levante
  • cap7_atm runs with tco95_core2 on Levante
  • cap7_ocean runs with tco95_core2 on Levante
  • cap7_seaice runs with tco95_core2 on Levante
  • cap7_land runs with tco95_core2 on Levante
  • cap7_aerosol runs with tco95_core2 on Levante
  • veg_atm runs with tco95_core2 on Levante
  • veg_land runs with tco95_core2 on Levante
  • veg_seaice runs with tco95_core2 on Levante
  • extra_atm runs with tco95_core2 on Levante
  • extra_land runs with tco95_core2 on Levante
  • final check of all rules.

@JanStreffing JanStreffing changed the base branch from main to prep-release April 2, 2026 12:55
@JanStreffing JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 1a42875 to 5617a18 Compare April 2, 2026 13:18
@JanStreffing
Copy link
Copy Markdown
Contributor Author

I was able to run both tos and the more complex absscint with the lastest commit on this branch. We may want to work on #267, and certainly need to work on #265. But neither should block us from starting to build up more rules for the picontrol variables.

@JanStreffing
Copy link
Copy Markdown
Contributor Author

JanStreffing commented Apr 5, 2026

The single failing test (test_library_process[fesom_dev-native-dask-cmip6]) is unrelated to this PR. It failed due to a network issue — the test tried to download pi_uxarray.tar from nextcloud.awi.de but got ConnectionError: [Errno 111] Connection refused. The AWI Nextcloud server was unreachable from the GitHub Actions runner.

Something to worry about? @pgierz

@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 6, 2026
@mandresm
Copy link
Copy Markdown
Contributor

mandresm commented Apr 7, 2026

@pgierz, will you review this or should I go for it?

@pgierz
Copy link
Copy Markdown
Member

pgierz commented Apr 7, 2026

I will look

@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
@esm-tools esm-tools deleted a comment from github-actions Bot Apr 8, 2026
nwieters and others added 6 commits April 9, 2026 15:04
calculate_chunks_simple() could produce chunk sizes larger than the
actual dimension when the dataset was very small (e.g. fx variables
with only 47 or 57 elements). This caused netCDF4 createVariable to
raise "chunksize cannot exceed dimension size". Add min() clamp.
When xarray reads NetCDF files, it stores 'coordinates' in the variable
encoding dict. set_coordinate_attributes then sets it in attrs too,
causing xarray to raise "'coordinates' found in both attrs and encoding"
on save. Pop from encoding before setting in attrs.
CMIP7 salinity units are "1E-03" (dimensionless with scaling factor).
pint-xarray cannot convert to such targets, raising "Unit expression
cannot have a scaling factor". When both source (e.g. psu) and target
are dimensionless, relabel without numeric conversion since the values
are already in the correct range.
All ~33 custom step functions hardcoded the CMOR variable name
(e.g. result.name = "zostoga") instead of using rule.model_variable.
This caused KeyError in set_variable when model_variable differs from
the CMOR name (e.g. model_variable="temp" but name was "zostoga").

Also fix compute_mass_transport to handle W-level data (nz=48
interfaces) by averaging to cell centers (nz=47) before multiplying
by layer thickness.
cli3X veg_land OSError root cause: every load_lpjguess_* function did
``for _, row in df_all.iterrows():`` to scatter rows into a pre-allocated
(time, ncells) numpy array. At HR LPJ-GUESS resolution that's millions
of Python iterations holding the GIL for several minutes per rule.

A dask worker thread that holds the GIL can't service its own
LocalCluster scheduler's heartbeat. After
distributed.comm.timeouts.connect (30s default) the scheduler marks
the worker dead and disconnects; the worker comes back when iterrows
finishes but the cluster has already cascaded into:

    OSError: Timed out trying to connect to tcp://127.0.0.1:<port> after 30 s

— visible in every cli3X veg_land shard log, killing all 3 shards.
The whole-rule retry decorator (8815440) catches the OSError but the
LocalCluster is too damaged to recover within the retry window.

Vectorized pattern (same logic, no Python row loop):

    coords_df_with_idx = coords_df.copy()
    coords_df_with_idx["_cell_idx"] = np.arange(len(coords_df_with_idx))
    df_merged = df_all.merge(
        coords_df_with_idx[["Lon", "Lat", "_cell_idx"]],
        on=["Lon", "Lat"], how="left",
    )
    valid = ~np.isnan(df_merged["_cell_idx"].values)
    cell_idx_int = df_merged["_cell_idx"].values[valid].astype(np.int64)
    yr_idx = np.searchsorted(years, df_merged["Year"].values[valid])
    values[yr_idx, cell_idx_int] = df_merged[var].values[valid]

Pandas merge runs in C and releases the GIL during the join. Load
should drop from "minutes blocking GIL" to "seconds".

Applied to all six loaders that used the iterrows pattern:
  load_lpjguess_monthly        (yearly + monthly format branches)
  load_lpjguess_yearly
  load_lpjguess_yearly_lut
  load_lpjguess_monthly_lut
  load_lpjguess_monthly_depth  (3D: time × sdepth × ncells)
  load_lpjguess_monthly_pool   (3D: time × soilCpool × ncells)

The 3D loaders still iterate the small depth/pool dim (3-6 layers) in
Python — that's bounded and fast. The hot loop (Python over all rows)
is gone in all six.
extra_atm sacct history (May 1-15): every reliable completion used
512G memory.

  Whole-tier era (pre-shard):
    24733123 mem512   01:51:28
    24733545          01:11:17
    24743470 cli3     02:07:45
    24748551 cli5     02:34:58
    24784803 cli9     01:22:15
    24813316 cli17    01:10:25

  Sharded --mem=0 era (default ~256G):
    cli28-cli36: OOM × 6, TIMEOUT × 1, two completions at non-default
    241016M (~235G).

6 completions on 512G + 0 reliable completions on 256G — proven config.
cli36 forensics confirmed the failure mode is not fix3-wedge or
allocator (jemalloc on, fix3=off, 17/20 rules saved): MaxRSS=167G but
MaxVMSize=458G shows malloc fragmentation pushing past the cgroup,
plus heavy concurrent hourly OIFS + 3D rules ('cl', 'pfull' both at
55min when 3 new rules started) tipping it over.

Adds 1 task to the scarcer ~282-node 512G pool; total 512G footprint
becomes 5/35 tasks (14% of work). Pool pressure remains minor.

Alternative considered: SHARD_SIZE=10 to halve concurrent-heavy count.
Architecturally cleaner but unproven for this tier — chose the path
with 6 historical successes.
…/_infer_frequency_multifile

Three Python-loop hotspots replaced with pandas vectorized paths:

1. ``select_range`` and ``validate_range``: ``df["start"].apply(pd.Timestamp)``
   → ``pd.to_datetime(df["start"])``. Same conversion, but C-level under
   the hood. These run on every Filecache load that filters by time range.

2. ``_infer_frequency_multifile``: ``for _, row in df.iterrows()`` building
   the timestamp list → ``pd.to_datetime(..., errors="coerce")`` + boolean
   masks. The try/except per-row pattern becomes a single ``valid`` mask;
   the steps-1/2/>2 branching reduces to a ``steps == 2`` mask appending
   end timestamps in addition to starts.

Net: no behaviour change, no memory change (same Series being materialized,
just via vectorized parse path). All 17 filecache unit tests pass.

Same diagnostic motivation as the lpjg vectorization (7900daa): Python
iteration over pandas/numpy in dask-managed code holds the GIL longer
than necessary and starves the LocalCluster heartbeat. filecache loads
happen at flow startup, so the impact here is on cold-start time, not
mid-rule deadlocks — but the antipattern is the same.

Not touched: ``regrid_regular_to_fesom`` in examples/custom_steps.py
(also has a per-timestep loop) — not referenced by any production cli3X
rule yaml, only an old test config. Dead code for the production path.
lrcs_seaice was given 6h walltime in 0c4aff1 because cli30
lrcs_seaice_3 TIMEOUTed at 3h. That was before jemalloc (cli34/5/6/7
made the allocator opt-in then global). The previous slow path was
driven by glibc fragmentation, not real compute time.

Empirical wall times since jemalloc landed:
  cli35 lrcs_seaice _1/2/3/4: 3:17, 1:08:52, 1:11:16, 46:13
  cli36 lrcs_seaice _1/2/3/4: 3:07, 1:12:04, 1:10:59, 56:15
  slowest single shard: 1:12:04 — well under 3h

3h gives 1h48m headroom on the worst observed run, which is enough.
All tiers now use the global WALLTIME default (3h) — uniform, simpler.
New env knob ``SHARD_DRS`` (default ``off``) toggles pycmor's
``enable_output_subdirs`` config option. When on:

  * The injected per-shard yaml sets
    ``pycmor.enable_output_subdirs = True``, so files.py's filename
    builder appends the DRS sub-tree returned by GlobalAttributes
    .subdir_path():

        <mip_era>/<activity_id>/<institution_id>/<source_id>/
            <experiment_id>/<member_id>/<table_id>/<variable_id>/
            <grid_label>/v<YYYYMMDD>/

    e.g. ``CMIP7/CMIP/AWI/AWI-ESM-3/picontrol/r1i1p1f1/Amon/tas/gn/
    v20260515/tas_<...>.nc``

  * The submitter collapses the per-tier ``<tier>/cmorized/`` OUTSUB
    to ``.`` so all 17 tiers land in one shared DRS root. The non-DRS
    nesting was a pycmor-specific organizational layer that downstream
    consumers don't expect at the start of the path.

Off by default — keeps the legacy per-tier flat layout for any current
consumers. Turn on via ``SHARD_DRS=on bash submit_hr_year_shards.sh ...``
for production / publication-ready output.
…zed layout

build_html_report: render_file_card was showing variable-group bounds
(whichever cadence happened to merge first) for every per-file card.
21 variables with multiple cadence rows displayed wrong "Expected"
columns and range SVG. Prefer rec.expected_* over ent.expected_*.

sanity_check: walker glob() missed files under <root>/<tier>/cmorized/
(produced when the runner emits a DRS-shaped subtree). Switch to
rglob() so the walker handles both flat and nested layouts.

run_walker_compute.sh: new sbatch wrapper to walk on compute nodes;
login-node OpenBLAS pthread limit + Lustre contention crashes the
worker pool on the big 8-19 GB 1hr/3D-model-level files.

run_perfile_maps_compute.sh: paths bumped to cli37 run.
build_maps: collapse all non-spatial dims (time + level + tile) by the
matching reducer per panel (min/min, mean/mean, max/max) so 3D fields
get full spatial coverage instead of mostly-NaN deep-level slices.
Other fixes: rglob() for nested cmorized/ layout, sibling-file lat/lon
recovery for hxy-si nod2 files that strip coords, dask-chunked open so
84-GB cl_day/pfull_day render without OOM, 2-98 percentile colorbar
clipping, prefer lat/lon over latitude/longitude for spatial-dim
discovery on 3D atm files.

build_html_report: per-file card now uses rec-level expected bounds
(was showing variable-group bounds, wrong for 21 cadence-split vars
incl. hfls, hfss, pr, prsn, rsds, rsus, uas, vas). NH/SH suffix on
hemispheric scalar files (siarea, siextent, sisnmass, sivol) with
North->South word substitution. FAIL diagnoses now append a
"mean-in-bounds, outlier-extremes" hint when the field mean sits
inside the expected window (e.g. hfls_mon FAIL is single-cell, not
field-wide).

sanity_check_ranges: recalibrate bounds with reviewer feedback —
phcint allow negatives (T<0degC at high lat), masscello to AWI-ESM
vertical discretization (5125..358750 kg/m^2), prra and siconc to
ice-zone/global walker semantics, cl to per-layer (~5%) not
column-integrated (~30%), umo/vmo/sfx/sfy reflect compute fix below.

custom_steps.compute_mass_transport, compute_salt_transport: integrate
across the FESOM cell edge by multiplying by sqrt(cell_area). Was
emitting kg/(s*m) but the CMIP7 spec for Omon.{umo,vmo,sfx,sfy}
requires kg/s. New _fesom_edge_width helper resolves the horizontal
dim against (nod2, ncells, ncol) and falls back to a clear error if
the mesh lacks cell_area. Order-of-magnitude check: typical FESOM HR
mid-lat cell now ~4.5e7 kg/s, strong-current cell ~1.8e9 kg/s.

cli37 walker output: 743 files, PASS 394 / WARN 251 / FAIL 96 /
ERROR 2.
@JanStreffing JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 09fe51e to 900c575 Compare May 18, 2026 14:44
JanStreffing and others added 22 commits May 18, 2026 16:52
… lat/lon

Implements the cmor/CMIP7 corrections from the cli37 sea-ice review.

hxy-si rules now apply mask_where_no_seaice so the output matches CMIP7
cell_methods "area: time: mean where sea_ice (mask=siconc)". Previously
prra/prsn (and all other rules using scale_pipeline / snm_pipeline /
sispeed_pipeline / sistressave_pipeline / sistressmax_pipeline /
siflcondtop_pipeline / sisnhc_pipeline / sitempbot_pipeline / sifb_pipeline /
simpeffconc_pipeline / constant_field_pipeline / sisnhc_from_msnow_pipeline /
fraction_to_percent_pipeline / sfdsi_from_fw_ice_pipeline) wrote the
field over the full FESOM domain including the tropics. New
scale_mask_pipeline / fraction_to_percent_mask_pipeline /
sfdsi_from_fw_ice_mask_pipeline cover the cases where the original
pipeline is shared with a non-hxy-si caller; the rest got
mask_where_no_seaice added in-place since every consumer is hxy-si.
The 17 direct-passthrough hxy-si rules (no pipelines: block) remain
unmasked — the audit directive was scoped to "every hxy-si rule
pipeline".

sisaltmass: scale_factor 0.004 -> 3.64. FESOM's m_ice is effective ice
height per cell area (units 'm', long_name "ice height per unit area"
in io_meandata.F90), not kg/m² as the prior rule comment assumed. The
correct salt-mass conversion is (sice/1000) * m_ice * rho_ice =
0.004 * m_ice * 910 = 3.64 * m_ice. sice=4.0 verified from
Final_CMIP7_IO_Test_06/config/fesom/namelist.ice; rho_ice=910 (AOMIP)
from fesom-2.7 MOD_ICE.F90:61.

sidmasstran[xy]: compute_ice_mass_transport now does
uice * m_ice * rho_ice * sqrt(cell_area) and emits kg/s. Reuses the
existing _fesom_edge_width helper (added by 900c575 for umo/vmo/sfx/sfy).
Previously the formula was uice * m_ice mislabelled "kg/s" — actually
m²/s, ~5 orders of magnitude too low.

regrid_oifs_to_fesom: assign_coords lat/lon on nod2 after the isel.
Source-grid lat/lon were attached to the now-removed source_dim; without
re-attaching the FESOM lat/lon, the output had only nod2 as a dim and
external tools (ushow, Panoply, ncview) couldn't render the field. The
sanity-check map renderer already worked around this with
_find_sibling_latlon; that workaround stays as a defence-in-depth fallback.

sisnthick: not added — CMIP7 v1.2.2.2 has no sisnthick out_name. The
CMIP7 equivalent in seaIce is snd (standard_name surface_snow_thickness,
units m), already produced by core_seaice/snd and cap7_seaice/snd_day.
…ealm

The walker recorded ``realm`` from the bounds-table lookup, which is
keyed by out_name only. For out_names that exist in multiple CMIP
realms (rlds in atmos+seaIce, prra, prsn, evspsbl, rsds, rsus, rlus,
ts, ...) every branded variant got stamped with whichever realm the
bounds table happened to store — typically the atm realm. The
``-hxy-si`` files then landed on ``atm.html`` even though their own
``:realm`` global attribute correctly said ``seaIce``.

worker_main now reads ``nc.realm`` and prefers it over the table value
for the emitted JSONL record. The file's own CMIP global attrs are
authoritative; the bounds-table realm is the fallback for ERROR or
NOBOUNDS records that never got a realm written.

collapse() in build_html_report.py now keys VarEntry by (var, realm)
instead of var alone, so a variable that appears under two realms
splits into two cards routed to the right HTML page. Previously
``ent.realm`` was "first record wins" — bundling all branded variants
under whichever record happened to come first in the JSONL.

Reported by Nadine on the cli37 report (rlds-si, prsn-si, ts-si etc.
showing up in atm.html with global data).
Each disabled rule emitted the same compound_name as an existing core
or cap7_atm rule with divergent content. Same metadata in both files
(long_name, standard_name, cell_methods, realm, units) so downstream
tools couldn't tell the LPJ-GUESS value from the OIFS one. CMIP7 ships
the OIFS coupled value as the canonical evspsbl/mrro/mrros/mrso/
mrsol/snc/snd/snw.

Re-enable under distinct CMIP7 names (e.g. evspsblveg for canopy,
tran for transpiration) if LPJ-specific land-hydrology values are
needed for vegetation analysis later.

core_land/areacella: kept enabled (duplicates core_atm but content is
byte-identical and each tier needs its own copy for cell_measures
resolution in stand-alone tier delivery — added comment noting this
is intentional).

Audit:
  grep -h '^\s*compound_name:' awi-esm3-veg-hr-variables/*/cmip7_*.yaml     | grep -v '^\s*#' | sort | uniq -c | awk '$1>1'
  -> 1 line (atmos.areacella, intentional)
…sin/sltbasin lat attrs

Christian's cli37 review flagged four ocean-side recipe issues that
fall on the pycmor side. All addressed here; the remaining items
(wfo / obvfsq / wo / volcello / opottempdiff bounds) are either
model-team questions or sanity-bound tuning, not recipe bugs.

hfds: FESOM `fh` is positive=UP (ice_oce_coupling.F90:418 writes
``heat_flux_in = heat_flux = -net_heat_flux``; density-flux equation
at line 654 confirms positive `heat_flux_in` corresponds to ocean
cooling/densification). CMIP7 `hfds` is positive=DOWN, so the rule
now applies `scale_factor: -1.0` via scale_pipeline. Previously hfds
was a bare passthrough, giving cli37 reviewers a North Atlantic that
appears to GAIN heat in January.

msftbarot: f_min default 1e-5 → 2.5e-5 in compute_msftbarot. The
geostrophic SSH approximation `psi = rho_0 * g * H / f * eta` breaks
down within ~10° of the equator, and at f=1e-5 only |lat| < ~4° gets
masked — leaving a band ±4-10° where f is small but finite, producing
the "weird artifact near the equator" Christian flagged. f_min=2.5e-5
extends the NaN mask to ±~10°.

compute_mass_transport: vertical component (transport_component='z',
used by wmo) now multiplies by `cell_area` instead of
`dz * sqrt(cell_area)`. The horizontal-edge formula
(dz * sqrt(cell_area), correct for umo/vmo) under-counted vertical
flux by ~dz/sqrt(cell_area) ≈ 50/1e4 ≈ 200× at FESOM HR resolution.

hfbasin / sltbasin: `lat` coordinate variable was attached as a bare
numpy array with no attributes; the written file then had `lat`
without `standard_name`/`units`/`axis`. Added CF attrs on both
compute_hfbasin_tripyview and compute_sltbasin_tripyview output paths.
…iew)

difmxylo: HR-FESOM (DARS ~10km) Laplacian eddy mixing values are
much lower than the coarse-resolution defaults the original bounds
assumed (Christian: "for higher resolution values should be lower as
far as I read up on this"). cli37 cmor observed mean ~0.01 m²/s,
max ~1.6 m²/s — bound expected_mean was ~1000, max ~1e4 (~5 orders
too high). New bounds 0 / ~0.01 / ~5 m²/s give PASS on cli37 stats
and stay narrow enough that a 100× regression would still WARN.

sfdsi: Christian: "pattern looks good - not sure why the bounds fail
here". cli37 cmor observed min -1.6e-4, max 6e-5 kg m-2 s-1; previous
bound min/max ±1e-5 was too narrow — Arctic freezing bursts in
fully-resolved sea-ice can hit ~1e-4 kg m-2 s-1 even though the
global mean stays near zero. Widened to ±5e-4 (~5× the observed peak).
Recipe commit e7306e2 added scale_factor: -1.0 to hfds (FESOM `fh` is
positive=up; CMIP positive=down). The bounds table still encoded the
pre-flip expectation:
  - expected_mean=~2 (typical positive piControl drift)
  - bounds ±300 W/m²

Post-flip, observed mean becomes negative (cli37 was +14.2 → post-fix
-14.2). The classify() sign-mismatch check at sanity_check.py:337-338
auto-FAILs when expected_mean > 0 and observed < 0, so the bounds
need updating regardless of magnitude.

Also widening to ±500: cli37 max=1910 W/m² (post-flip min=-1910)
exceeds ±300, and CMIP6 literature notes per-cell monthly peaks
reach ±500 W/m² in deep-convection / strong air-sea contrast regions.
Beyond ±500 stays a FAIL — sentinel for outlier cells that warrant
investigation (1910 W/m² is real but extreme).

expected_mean ~2 → ~0: piControl steady state has global mean
near 0; small drift in either direction is fine. The em==0.0 branch
of classify() (line 328) tolerates |mean|/envelope up to 10% as PASS.
Christian's cli37 review flagged hfbasin as FAIL on max=7.6 PW vs the
±2 PW bound. cli37 inspection shows the recipe is correct: annual
climatology matches Trenberth almost exactly (global peak +1.96 PW
@ 16°N; Atlantic +1.08 PW @ 16°N vs Trenberth ~1.3 PW). The 7.6 PW
is a monthly Indo-Pacific extreme — basin-sum peak across (lat, time).

Literature for monthly hfbasin extremes:
  - Trenberth & Caron 2001: global annual peak ~2 PW @ 35°N
  - RAPID 26.5°N Atlantic monthly range 0.2-2.5 PW (Johns 2011)
  - Pacific tropical cell NHT 1.75±0.30, SHT -1.69±0.55 PW
  - North Pacific 24°N seasonal cycle 0 → 1.1 PW (Bryden et al.)
  - HR mesoscale eddies add 30-50% over coarse models

Combined: global = basin-sum at peak latitude reaches 5-8 PW
monthly extreme in HR models. cli37's 7.6 PW max sits within this
range. Widen bounds to ±10 PW so monthly extremes PASS while a
2× regression still WARNs and 5× regression FAILs.
- nep: -5e-6 / 2e-7 (was -5e-8 / 5e-8) — Central America wet-tropics
  drainage spikes are real model output (cli37 min -1.93e-6, max 1.01e-7)
- mrrob: max 5e-3 (was 1e-4) — singular grid-cell spikes are real,
  global field fine (cli37 max 2.34e-3)
- fHarvestToAtmos: ~1e-10 mean / ~1e-8 max (was ~0/~0) — 1850 LUH3
  state has cropland+pasture that keeps being harvested every year
- tsl: floor 150 K (was 220 K) — LPJ-GUESS shallow-layer tracks
  OpenIFS forcing when uninsulated; 1.5% of values <220 K in NH winter
- vegHeight: rationale text fixed — LPJ-GUESS doesn't emit grass-only
  height, cmor falls back to tree-dominated field

Reclassify on test06 JSONL: 5 FAILs cleared (2 nep, 2 fHarvestToAtmos,
1 mrrob FAIL->WARN).

Three "should be zero" vars (fAnthDisturb, fNAnthDisturb, fNfert) NOT
relaxed despite cli37 showing 1e-10..1e-8 values — Laszlo says these
should be truly ~0 under frozen 1850 LU; spawned as D6 investigation.
…source

Three new pipeline steps in examples/custom_steps.py:

  clip_small_negatives  — sets [-thr, +thr] to 0 (default thr=1e-10).
                          Clears tiny float-noise that propagates from
                          raw .out files. Reviewer (Laszlo) confirmed
                          these are not a bug to chase upstream.

  clip_floor_zero       — floors negatives at 0. For variables whose
                          physical floor is 0 but where the pycmor
                          pipeline introduces negatives not present in
                          the raw model output (mrsol, rhSoil per
                          Laszlo).

  broadcast_yearly_to_monthly
                        — repeats each yearly value across 12 mid-month
                          time stamps. Used where the native monthly
                          file is LAI/phenology-weighted and the yearly
                          file is the authoritative stand-area source.

Pipeline wiring:

  Phase B (clip_small_negatives inserted after loader in 4 pipelines):
    cap7_land: lpjg_monthly_pipeline
    veg_land:  lpjg_monthly_pipeline, lpjg_monthly_lut_pipeline,
               mrtws_pipeline

  Phase C (new dedicated pipelines, 2 rules rewired):
    cap7_land: lpjg_monthly_clip0_pipeline       <- rhSoil_mon
               lpjg_monthly_depth_clip0_pipeline <- mrsol_mon
    Both add clip_floor_zero on top of clip_small_negatives.

  D4a (new pipeline, treeFrac total source switch):
    cap7_land: lpjg_yearly_to_monthly_pipeline   <- treeFrac_mon
               treeFrac_mon now sources treeFrac_yearly.out (was
               treeFrac_monthly.out). The native monthly file is
               LAI-weighted; the yearly file is correct stand area.

variable_attributes.comment added on treeFrac_mon, rhSoil_mon, mrsol_mon
documenting the derivation / clipping for downstream traceability.

The per-PFT tree fractions (treeFracBdlDcd, treeFracBdlEvg,
treeFracNdlDcd, treeFracNdlEvg) remain on the broken monthly source
pending D4c — Laszlo signoff on the synthesis math AND/OR a possible
LPJ-GUESS upstream fix he pushed today emitting yearly per-PFT files.
See HANDOFF_d4_treeFrac_per_pft.md.
PLAN_veg_land_sanity_fixes.md — triages Laszlo + Christian feedback
into 5 phases (A bounds, B clip-band, C sign-fix, D investigations,
E verify) with explicit execution order. Round-1 review folded in.

HANDOFF_d4_treeFrac_per_pft.md — investigation of treeFracNdlDcd
annual-cycle bug. Documents:
  - root cause: LPJ-GUESS monthly per-PFT output is LAI/phenology
    weighted, not stand area
  - file inventory: treeFrac_yearly.out exists (total), per-PFT
    yearly files do NOT
  - four-option matrix (A skip / B synthesize / C max-only /
    D wait-for-upstream)
  - Option B math + per-cell normalization + divide-by-zero handling
  - D4b ratio sanity check on cli37: median 1.010, 88% within ±20%
    — math premise sound
  - escalation channels + comment-attribute draft

REVIEW_veg_land_sanity_fixes_round1.md — external review of the plan;
folded into PLAN.

REVIEW_d4_treeFrac_per_pft.md — external review of the handoff;
folded into HANDOFF.

Three "should be zero" variables spawned as D6 from Phase A data
anchor: fAnthDisturb, fNAnthDisturb, fNfert have 1e-10..1e-8 values
in cli37 despite frozen 1850 LU — pending Laszlo round-3 investigation.
cli37 wo file inspection: olevel = [0, 5, 10, 20, 30, ..., 6250] —
i.e. mesh.depth_bnds values (interface depths), not mesh.depth
(midpoints), even though olevel:name = "nz1". FESOM 2.7's
io_meandata.F90:1520 def_streams w on the top N layer interfaces
(N=57 for DARS), with the surface at z=0 and the seabed BC implicit.

The reviewer's "uppermost layer not too bad, those below are noisy"
is the surface BC literally preserved (w=0 at interface 0) while every
deeper interface carries diagnostic-w noise from integrating
horizontal divergence down from the surface. The recipe was a bare
passthrough that emitted these interfaces verbatim with a misleading
midpoint-style coord name.

New compute step ``average_w_interfaces_to_midpoints``:
  midpoint[i]   = 0.5 * (w[i] + w[i+1])    for i = 0 … N-2
  midpoint[N-1] = 0.5 * w[N-1]              (bottom BC w_seabed=0)
The vertical coord is replaced with mesh.depth (CMIP midpoint axis)
and the dim renamed to nz1. ``compute_mass_transport`` already does
the same averaging internally for ``wmo``; this factors it out for
use in the new wo_pipeline.

Effects:
  - Surface BC folded into first midpoint → no jump between layer 0
    and 1; the "clean top" outlier disappears.
  - Output on 57 cell-centre midpoints, matching CMIP wo convention.
  - Vertical coord values switch from interface depths to midpoint
    depths.
…fNAnthDisturb, fNfert

LUH3 1850 wood-harvest transitions (primf/secmf/secnf_harv) and a small
implicit LPJ-GUESS management N baseline make these fluxes non-zero in
piControl
The OIFS branch feature/cmip7-rh-online (merged into local_combined_fixes
2026-04-17) computes CMIP/CF-conformant relative humidity online every
timestep using Alduchov-Eskridge Magnus with the hard water/ice switch
at 273.15 K, and sends `hur_cmip7` (model levels) and `hurs_cmip7` (2m)
to XIOS. Until now those sends were dropped because no XIOS field
declared the IDs.

field_def_cmip7.xml.j2:
- declare raw `hurs_cmip7` (fraction) in 2D_physical
- declare raw `hur_cmip7` (fraction) in 3D_ml
- add percent alias `near_surface_relative_humidity_pct__hurs` (name="hurs")
- repoint `relative_humidity_ml__hur` from `r` (FOEEWM mixed-phase QSAT,
  NOT CMIP-conformant) to `hur_cmip7`

file_def_oifs_cmip7_spinup.xml.j2:
- write `hurs` into atmos_mon (monthly surface)
- write `hurs` into atmos_day (daily surface)
- atmos_mon_ml already references relative_humidity_ml__hur, so it
  picks up CMIP7-conformant hur automatically

Addresses Felix Pithan's 2026-04-15 feedback: IFS-native r/r_pl uses
mixed-phase QSAT interpolation between RTICE and RTWAT (not CMIP-CF),
and post-hoc RH from monthly-mean ta/hus is biased due to nonlinear
e_sat(T). Computing online and averaging downstream fixes both.

Pycmor-side: hurs_pipeline (post-hoc Magnus from 2t+2d) is left in
place for now; can be retired once new XIOS hurs output is validated.
…ve ice pipelines

Commit 5667ab2 added mask_where_no_seaice broadly to every hxy-si rule.
That was the right fix for prra (Christian's explicit complaint: "rule
cell_methods: where sea_ice is not applied - one sees a global field
instead of a polar field") and for the ~9 atm-regridded rules
(rlds/rlus/rsds/rsus_seaice, sbl_seaice, sifllattop, siflsenstop) that
carry atmospheric fluxes over the entire surface and need masking to
the ice domain for CMIP cell_methods compliance.

For the remaining 13 FESOM-native ice pipelines, the underlying FESOM
field is intrinsically zero (or NaN) outside the ice zone — the model
only updates these where ice exists. Masking those is at best
cosmetic (0 -> NaN) and at worst causes downstream dask graph stalls:
in cli38 the lrcs_seaice tier OOM'd at MaxRSS 525 GiB on a 512 GiB
cgroup; halving N_WORKERS produced an identical 525 GiB peak (memory
is per-rule, not per-worker), and halving SHARD_SIZE to spread the
load eliminated the OOMs but exposed a save_dataset wedge in
sivol/sihc/snm/siflfwbot/etc. with no I/O progress for 45+ minutes —
the dask graph for `data.where(mask)` was not being released before
save_dataset, deadlocking the worker pool.

Removed mask step from these pipelines:
  scale_mask_pipeline, sfdsi_from_fw_ice_mask_pipeline,
  fraction_to_percent_mask_pipeline, snm_pipeline, sisnhc_pipeline,
  sispeed_pipeline, sistressave_pipeline, sistressmax_pipeline,
  siflcondtop_pipeline, sitempbot_pipeline, sifb_pipeline,
  simpeffconc_pipeline, constant_field_pipeline.

Kept mask in these pipelines (genuinely needed for atm-regridded fields):
  regrid_atm_to_fesom_seaice_mask_pipeline
  regrid_atm_to_fesom_seaice_mask_negate_pipeline

Rules affected by the removal still inherit their previous correctness
fixes from 5667ab2 (scale_factor on sisaltmass, cell-width factor on
sidmasstran[xy], lat/lon attachment) — only the unnecessary mask step
is gone.
Upstream release Dec 19 2025 (v1.2.2.2 was Sep 30 2025). Same 1974
variables in both, but the compound_name schema changed in three ways:

- region suffixes lowercased: GLB/NH/SH -> glb/nh/sh
  (552 .GLB + 8 .NH + 8 .SH renames across 17 rule yamls)
- ATA/GRL polar codes lowercased: ata/grl (no project rules affected)
- 13 real compound_name schema renames touching this project:
  - atmosChem.{cfc11,cfc12,ch4,n2o}.tavg-u-hm-u -> .tavg-u-hm-air
  - land.tas.tavg-u-hxy-u.1hr -> .tavg-h2m-hxy-u.1hr (glb + 30S-90S)
  - land.tslsi.tavg-u-hxy-u.day -> .tavg-u-hxy-lsi.day
  - seaIce.sisnmass.tavg-u-hm-u -> .tavg-u-hm-si (day/mon x nh/sh)
  - atmos.ts.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day
  - land.esn.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day

Source default in std_lib/global_attributes.py:136 updated from "GLB"
to "glb" to match the new lowercase convention when get_region() falls
back to the default (rule paths already supply region explicitly via
compound_name).

Validation: all 17 rule yamls construct CMORizer cleanly against the
new metadata.json; 548/552 compound_names resolve in v1.2.2.3 (the 4
unmatched are Laszlo's per-PFT yearly diagnostic rules, extra-CMIP).

Metadata cache populated at:
  ~/.cache/pycmor/cmip7_metadata/v1.2.2.3/metadata.json (kept v1.2.2.2)

Substantive content edits in v1.2.2.3 (for variables we use):
- atmos.{hurs,rsds,sfcWind}.tavg-...-1hr.30S-90S: cleaner CF-standard
  long_name values (long_name was internal text in v1.2.2.2).
cli39 veg_land shards 2/3/4 failed with
  ValueError: Rule 'treeFracBdlEvg_yr' with
  compound_name='land.treeFracBdlEvg.tavg-u-hxy-u.yr.GLB' did not match any
  variables in the CMIP7 data request

CMIP7 only has the per-PFT treeFrac variants at MONTHLY cadence:
  Emon.treeFracBdlDcd / Emon.treeFracBdlEvg / Emon.treeFracNdlDcd / Emon.treeFracNdlEvg

No yearly Eyr.treeFrac{BdlDcd,BdlEvg,NdlDcd,NdlEvg} entries exist. The monthly
counterparts are already wired in cap7_land/cmip7_awiesm3-veg-hr_cap7_land.yaml
and cover the CMIP7 requirement.

Removes:
- treeFracBdlDcd_yr
- treeFracBdlEvg_yr
- treeFracNdlDcd_yr
- treeFracNdlEvg_yr

Keeps the aggregate treeFrac_yr (maps to valid Eyr.treeFrac).
cli39 extra_land (job 25003750_1) hit 3h walltime with 4 save_dataset
ops stuck at heartbeat #16:
  ⟳ save_dataset[tas]    still running (t=960s, #16)
  ⟳ save_dataset[mrsow]  still running (t=960s, #16)
  ⟳ save_dataset[dslw]   still running (t=960s, #16)
  ⟳ save_dataset[orog]   still running (t=960s, #16)

All four are long-running surface saves (1-hourly tas SH-subset, daily
mrsow/dslw via swvl1..swvl4 + temporal_diff, fx orog SH-subset). With
extra_land's 18 rules all in a single shard at SHARD_SIZE=20, these
four kicked off parallel saves that contended on the synchronous
netcdf write scheduler. 9 saves completed cleanly, then the 4 wedged.

Lower SHARD_SIZE specifically to 5 for extra_land — splits 18 rules
into ~4 shards, distributing the contended saves across them. Each
shard's individual save peak stays small enough to finish inside 3h.
Other tiers keep the global SHARD_SIZE=20.
Both tiers hit TIMEOUT in cli40 due to multiple save_dataset operations
contending on the global HDF5 write lock with no progress at heartbeat
#87+ over 3h. Sequence in lrcs_seaice_3:
  siarea 176×, siextent 89×, sisnmass 87×, sfdsi 87×, sihc 86×, sidmasstrany 86×
veg_land_1 wedged on vegHeight/dgw/tslsi/esn at heartbeat #21.

Wall-vs-Σsave analysis (cli40 successful shards) shows these two tiers
already run nearly serially under lock contention:
  lrcs_seaice_4: ratio 1.18× (12% wall-time penalty if forced serial)
  lrcs_seaice_2: 1.8×
Forcing strict serial eliminates the wedge with negligible cost.

3D-heavy tiers (cap7_atm 4.1×, core_ocean 3.7×, extra_atm 1.8×) genuinely
need parallel saves to stay inside 3h walltime, so they are NOT
modified.

Mechanism: every pipeline in the two yamls now declares
  throttle_group: <tier>_serial
and the pycmor: section adds
  throttle_caps:
    <tier>_serial: 1
which caps in-flight rules of that group at 1 per batch.

For lrcs_seaice this also subsumes the previous oifs_regrid cap=2 (which
was set for driver-RSS, not the lock); cap=1 trivially satisfies it.
cli41 logs showed "Throttle caps (per-group rule submission limit):
none" in both lrcs_seaice and veg_land — the yaml-level `throttle_caps`
block I added in 5a61ad3 got silently dropped by the Everett-based
PycmorConfigManager (only Options declared in PycmorConfig are
preserved through schema normalisation; unknown keys vanish).

Result: default per-group cap of 2 applied, lrcs_seaice still wedged on
9+ parallel save_dataset operations:
  cli41 _1: TIMEOUT (heartbeat #16, 9 saves stuck)
  cli41 _3: TIMEOUT (same pattern)

The cmorizer `_resolve_throttle_caps()` already reads PYCMOR_THROTTLE_CAPS
from the environment as a higher-priority source than the yaml. Move
the per-tier cap declaration to the launcher.

submit_hr_year_shards.sh:
- New per-tier `tier_throttle_caps` case:
    lrcs_seaice  → "lrcs_seaice_serial:1"
    veg_land     → "veg_land_serial:1"
    (others)     → ""
- Threaded into `--export=...,PYCMOR_THROTTLE_CAPS=$tier_throttle_caps`

Yaml cleanup:
- Removed the dead `throttle_caps:` block from both yamls
- Kept the explanatory comment so the next reader knows where the cap
  actually lives (the launcher)

Pipeline-level `throttle_group:` annotations from 5a61ad3 are unchanged
and are still read by Everett (declared on the Pipeline schema).
cli43 lrcs_seaice_3 TIMEOUT despite cap=1 being correctly applied:
  Throttle caps: {'lrcs_seaice_serial': 1}
  ... but 5 saves concurrent and stuck at heartbeat #101+

Root cause: `_rule_throttle_group()` only walked `rule.pipelines[*]`.
20 of 64 lrcs_seaice rules (siarea_north, siextent_north_day,
siflcondbot, sidconcth, sitimefrac, ...) have no `pipelines:` key —
they use pycmor's default pipeline, which carries no throttle_group,
so they landed in `_unthrottled` and ran in parallel through the
HDF5 global write lock.

Two-part fix:
1. cmorizer.py: read rule.throttle_group first, fall back to pipeline
   annotation. Rule-level wins (per-rule override), preserves existing
   pipeline-level behaviour.
2. lrcs_seaice + veg_land yamls: add `throttle_group: <tier>_serial`
   to the `inherit:` block. With validate.py:299 already accepting
   `throttle_group` on the rule schema, inherit propagates it to every
   rule — pipelined or not.

Effect on cli44+: all 64 lrcs_seaice rules (and all veg_land rules)
join their tier's serial group; PYCMOR_THROTTLE_CAPS env var caps it
at 1; the multi-rule HDF5-lock wedge can no longer form.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants