Feat/cmip7 awiesm3 veg hr by JanStreffing · Pull Request #266 · esm-tools/pycmor

JanStreffing · 2026-04-02T12:55:34Z

CMIP7 cmorization for AWI-ESM3-VEG-HR

Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.

Key changes

CMIP7 data request

Load DataRequest from CMIP7_DReq_metadata JSON instead of cmip6 tables
Refactor to native compound-name architecture (ocean.tos.tavg-u-hxy-sea.mon.GLB)
Fix JSON key mismatch: cmip6_table → cmip6_cmor_table in vendored metadata
Improve compound_name matching against cmip6_compound_name and cmip7_compound_name attributes
Derive table_id from compound name when not set explicitly
Strict ValueError on zero DRV matches (instead of silent skip)

Pipeline

Add generic vertical_integrate custom pipeline step
Remove duplicate convert() step from DefaultPipeline
Fix Prefect State objects not being unwrapped to actual results in parallel runs
Propagate pipeline/flow errors instead of silently logging them

Standard library

Add time bounds support (src/pycmor/std_lib/time_bounds.py)
Fix dimension mapping to use getattr + _pycmor_cfg fallback
Fix global_attributes to derive table_id from CMIP6/CMIP7 compound names

Xarray accessor API

Lazy accessor registration and StdLibAccessor with .process()

Test infrastructure

Modernize with entry-point model discovery (pycmor.fixtures.model_runs)
Add pycmor.tutorial dataset system (xarray.tutorial-style API)
Fix stub generator to use monotonic coordinate values for multi-file datasets

Misc fixes

Python 3.9 entry_points() compatibility
Guard pyfesom2 imports for environments without it
Fix tarball double-nesting extraction on Python 3.12+
Rename non-standard time dimension on load (OpenIFS support)

Test plan

JanStreffing · 2026-04-02T21:06:52Z

I was able to run both tos and the more complex absscint with the lastest commit on this branch. We may want to work on #267, and certainly need to work on #265. But neither should block us from starting to build up more rules for the picontrol variables.

JanStreffing · 2026-04-05T10:57:09Z

The single failing test (test_library_process[fesom_dev-native-dask-cmip6]) is unrelated to this PR. It failed due to a network issue — the test tried to download pi_uxarray.tar from nextcloud.awi.de but got ConnectionError: [Errno 111] Connection refused. The AWI Nextcloud server was unreachable from the GitHub Actions runner.

Something to worry about? @pgierz

mandresm · 2026-04-07T12:17:08Z

@pgierz, will you review this or should I go for it?

pgierz · 2026-04-07T16:45:01Z

I will look

calculate_chunks_simple() could produce chunk sizes larger than the actual dimension when the dataset was very small (e.g. fx variables with only 47 or 57 elements). This caused netCDF4 createVariable to raise "chunksize cannot exceed dimension size". Add min() clamp.

When xarray reads NetCDF files, it stores 'coordinates' in the variable encoding dict. set_coordinate_attributes then sets it in attrs too, causing xarray to raise "'coordinates' found in both attrs and encoding" on save. Pop from encoding before setting in attrs.

CMIP7 salinity units are "1E-03" (dimensionless with scaling factor). pint-xarray cannot convert to such targets, raising "Unit expression cannot have a scaling factor". When both source (e.g. psu) and target are dimensionless, relabel without numeric conversion since the values are already in the correct range.

All ~33 custom step functions hardcoded the CMOR variable name (e.g. result.name = "zostoga") instead of using rule.model_variable. This caused KeyError in set_variable when model_variable differs from the CMOR name (e.g. model_variable="temp" but name was "zostoga"). Also fix compute_mass_transport to handle W-level data (nz=48 interfaces) by averaging to cell centers (nz=47) before multiplying by layer thickness.

cli3X veg_land OSError root cause: every load_lpjguess_* function did ``for _, row in df_all.iterrows():`` to scatter rows into a pre-allocated (time, ncells) numpy array. At HR LPJ-GUESS resolution that's millions of Python iterations holding the GIL for several minutes per rule. A dask worker thread that holds the GIL can't service its own LocalCluster scheduler's heartbeat. After distributed.comm.timeouts.connect (30s default) the scheduler marks the worker dead and disconnects; the worker comes back when iterrows finishes but the cluster has already cascaded into: OSError: Timed out trying to connect to tcp://127.0.0.1:<port> after 30 s — visible in every cli3X veg_land shard log, killing all 3 shards. The whole-rule retry decorator (8815440) catches the OSError but the LocalCluster is too damaged to recover within the retry window. Vectorized pattern (same logic, no Python row loop): coords_df_with_idx = coords_df.copy() coords_df_with_idx["_cell_idx"] = np.arange(len(coords_df_with_idx)) df_merged = df_all.merge( coords_df_with_idx[["Lon", "Lat", "_cell_idx"]], on=["Lon", "Lat"], how="left", ) valid = ~np.isnan(df_merged["_cell_idx"].values) cell_idx_int = df_merged["_cell_idx"].values[valid].astype(np.int64) yr_idx = np.searchsorted(years, df_merged["Year"].values[valid]) values[yr_idx, cell_idx_int] = df_merged[var].values[valid] Pandas merge runs in C and releases the GIL during the join. Load should drop from "minutes blocking GIL" to "seconds". Applied to all six loaders that used the iterrows pattern: load_lpjguess_monthly (yearly + monthly format branches) load_lpjguess_yearly load_lpjguess_yearly_lut load_lpjguess_monthly_lut load_lpjguess_monthly_depth (3D: time × sdepth × ncells) load_lpjguess_monthly_pool (3D: time × soilCpool × ncells) The 3D loaders still iterate the small depth/pool dim (3-6 layers) in Python — that's bounded and fast. The hot loop (Python over all rows) is gone in all six.

extra_atm sacct history (May 1-15): every reliable completion used 512G memory. Whole-tier era (pre-shard): 24733123 mem512 01:51:28 24733545 01:11:17 24743470 cli3 02:07:45 24748551 cli5 02:34:58 24784803 cli9 01:22:15 24813316 cli17 01:10:25 Sharded --mem=0 era (default ~256G): cli28-cli36: OOM × 6, TIMEOUT × 1, two completions at non-default 241016M (~235G). 6 completions on 512G + 0 reliable completions on 256G — proven config. cli36 forensics confirmed the failure mode is not fix3-wedge or allocator (jemalloc on, fix3=off, 17/20 rules saved): MaxRSS=167G but MaxVMSize=458G shows malloc fragmentation pushing past the cgroup, plus heavy concurrent hourly OIFS + 3D rules ('cl', 'pfull' both at 55min when 3 new rules started) tipping it over. Adds 1 task to the scarcer ~282-node 512G pool; total 512G footprint becomes 5/35 tasks (14% of work). Pool pressure remains minor. Alternative considered: SHARD_SIZE=10 to halve concurrent-heavy count. Architecturally cleaner but unproven for this tier — chose the path with 6 historical successes.

…/_infer_frequency_multifile Three Python-loop hotspots replaced with pandas vectorized paths: 1. ``select_range`` and ``validate_range``: ``df["start"].apply(pd.Timestamp)`` → ``pd.to_datetime(df["start"])``. Same conversion, but C-level under the hood. These run on every Filecache load that filters by time range. 2. ``_infer_frequency_multifile``: ``for _, row in df.iterrows()`` building the timestamp list → ``pd.to_datetime(..., errors="coerce")`` + boolean masks. The try/except per-row pattern becomes a single ``valid`` mask; the steps-1/2/>2 branching reduces to a ``steps == 2`` mask appending end timestamps in addition to starts. Net: no behaviour change, no memory change (same Series being materialized, just via vectorized parse path). All 17 filecache unit tests pass. Same diagnostic motivation as the lpjg vectorization (7900daa): Python iteration over pandas/numpy in dask-managed code holds the GIL longer than necessary and starves the LocalCluster heartbeat. filecache loads happen at flow startup, so the impact here is on cold-start time, not mid-rule deadlocks — but the antipattern is the same. Not touched: ``regrid_regular_to_fesom`` in examples/custom_steps.py (also has a per-timestep loop) — not referenced by any production cli3X rule yaml, only an old test config. Dead code for the production path.

lrcs_seaice was given 6h walltime in 0c4aff1 because cli30 lrcs_seaice_3 TIMEOUTed at 3h. That was before jemalloc (cli34/5/6/7 made the allocator opt-in then global). The previous slow path was driven by glibc fragmentation, not real compute time. Empirical wall times since jemalloc landed: cli35 lrcs_seaice _1/2/3/4: 3:17, 1:08:52, 1:11:16, 46:13 cli36 lrcs_seaice _1/2/3/4: 3:07, 1:12:04, 1:10:59, 56:15 slowest single shard: 1:12:04 — well under 3h 3h gives 1h48m headroom on the worst observed run, which is enough. All tiers now use the global WALLTIME default (3h) — uniform, simpler.

New env knob ``SHARD_DRS`` (default ``off``) toggles pycmor's ``enable_output_subdirs`` config option. When on: * The injected per-shard yaml sets ``pycmor.enable_output_subdirs = True``, so files.py's filename builder appends the DRS sub-tree returned by GlobalAttributes .subdir_path(): <mip_era>/<activity_id>/<institution_id>/<source_id>/ <experiment_id>/<member_id>/<table_id>/<variable_id>/ <grid_label>/v<YYYYMMDD>/ e.g. ``CMIP7/CMIP/AWI/AWI-ESM-3/picontrol/r1i1p1f1/Amon/tas/gn/ v20260515/tas_<...>.nc`` * The submitter collapses the per-tier ``<tier>/cmorized/`` OUTSUB to ``.`` so all 17 tiers land in one shared DRS root. The non-DRS nesting was a pycmor-specific organizational layer that downstream consumers don't expect at the start of the path. Off by default — keeps the legacy per-tier flat layout for any current consumers. Turn on via ``SHARD_DRS=on bash submit_hr_year_shards.sh ...`` for production / publication-ready output.

…zed layout build_html_report: render_file_card was showing variable-group bounds (whichever cadence happened to merge first) for every per-file card. 21 variables with multiple cadence rows displayed wrong "Expected" columns and range SVG. Prefer rec.expected_* over ent.expected_*. sanity_check: walker glob() missed files under <root>/<tier>/cmorized/ (produced when the runner emits a DRS-shaped subtree). Switch to rglob() so the walker handles both flat and nested layouts. run_walker_compute.sh: new sbatch wrapper to walk on compute nodes; login-node OpenBLAS pthread limit + Lustre contention crashes the worker pool on the big 8-19 GB 1hr/3D-model-level files. run_perfile_maps_compute.sh: paths bumped to cli37 run.

build_maps: collapse all non-spatial dims (time + level + tile) by the matching reducer per panel (min/min, mean/mean, max/max) so 3D fields get full spatial coverage instead of mostly-NaN deep-level slices. Other fixes: rglob() for nested cmorized/ layout, sibling-file lat/lon recovery for hxy-si nod2 files that strip coords, dask-chunked open so 84-GB cl_day/pfull_day render without OOM, 2-98 percentile colorbar clipping, prefer lat/lon over latitude/longitude for spatial-dim discovery on 3D atm files. build_html_report: per-file card now uses rec-level expected bounds (was showing variable-group bounds, wrong for 21 cadence-split vars incl. hfls, hfss, pr, prsn, rsds, rsus, uas, vas). NH/SH suffix on hemispheric scalar files (siarea, siextent, sisnmass, sivol) with North->South word substitution. FAIL diagnoses now append a "mean-in-bounds, outlier-extremes" hint when the field mean sits inside the expected window (e.g. hfls_mon FAIL is single-cell, not field-wide). sanity_check_ranges: recalibrate bounds with reviewer feedback — phcint allow negatives (T<0degC at high lat), masscello to AWI-ESM vertical discretization (5125..358750 kg/m^2), prra and siconc to ice-zone/global walker semantics, cl to per-layer (~5%) not column-integrated (~30%), umo/vmo/sfx/sfy reflect compute fix below. custom_steps.compute_mass_transport, compute_salt_transport: integrate across the FESOM cell edge by multiplying by sqrt(cell_area). Was emitting kg/(s*m) but the CMIP7 spec for Omon.{umo,vmo,sfx,sfy} requires kg/s. New _fesom_edge_width helper resolves the horizontal dim against (nod2, ncells, ncol) and falls back to a clear error if the mesh lacks cell_area. Order-of-magnitude check: typical FESOM HR mid-lat cell now ~4.5e7 kg/s, strong-current cell ~1.8e9 kg/s. cli37 walker output: 743 files, PASS 394 / WARN 251 / FAIL 96 / ERROR 2.

… lat/lon Implements the cmor/CMIP7 corrections from the cli37 sea-ice review. hxy-si rules now apply mask_where_no_seaice so the output matches CMIP7 cell_methods "area: time: mean where sea_ice (mask=siconc)". Previously prra/prsn (and all other rules using scale_pipeline / snm_pipeline / sispeed_pipeline / sistressave_pipeline / sistressmax_pipeline / siflcondtop_pipeline / sisnhc_pipeline / sitempbot_pipeline / sifb_pipeline / simpeffconc_pipeline / constant_field_pipeline / sisnhc_from_msnow_pipeline / fraction_to_percent_pipeline / sfdsi_from_fw_ice_pipeline) wrote the field over the full FESOM domain including the tropics. New scale_mask_pipeline / fraction_to_percent_mask_pipeline / sfdsi_from_fw_ice_mask_pipeline cover the cases where the original pipeline is shared with a non-hxy-si caller; the rest got mask_where_no_seaice added in-place since every consumer is hxy-si. The 17 direct-passthrough hxy-si rules (no pipelines: block) remain unmasked — the audit directive was scoped to "every hxy-si rule pipeline". sisaltmass: scale_factor 0.004 -> 3.64. FESOM's m_ice is effective ice height per cell area (units 'm', long_name "ice height per unit area" in io_meandata.F90), not kg/m² as the prior rule comment assumed. The correct salt-mass conversion is (sice/1000) * m_ice * rho_ice = 0.004 * m_ice * 910 = 3.64 * m_ice. sice=4.0 verified from Final_CMIP7_IO_Test_06/config/fesom/namelist.ice; rho_ice=910 (AOMIP) from fesom-2.7 MOD_ICE.F90:61. sidmasstran[xy]: compute_ice_mass_transport now does uice * m_ice * rho_ice * sqrt(cell_area) and emits kg/s. Reuses the existing _fesom_edge_width helper (added by 900c575 for umo/vmo/sfx/sfy). Previously the formula was uice * m_ice mislabelled "kg/s" — actually m²/s, ~5 orders of magnitude too low. regrid_oifs_to_fesom: assign_coords lat/lon on nod2 after the isel. Source-grid lat/lon were attached to the now-removed source_dim; without re-attaching the FESOM lat/lon, the output had only nod2 as a dim and external tools (ushow, Panoply, ncview) couldn't render the field. The sanity-check map renderer already worked around this with _find_sibling_latlon; that workaround stays as a defence-in-depth fallback. sisnthick: not added — CMIP7 v1.2.2.2 has no sisnthick out_name. The CMIP7 equivalent in seaIce is snd (standard_name surface_snow_thickness, units m), already produced by core_seaice/snd and cap7_seaice/snd_day.

…ealm The walker recorded ``realm`` from the bounds-table lookup, which is keyed by out_name only. For out_names that exist in multiple CMIP realms (rlds in atmos+seaIce, prra, prsn, evspsbl, rsds, rsus, rlus, ts, ...) every branded variant got stamped with whichever realm the bounds table happened to store — typically the atm realm. The ``-hxy-si`` files then landed on ``atm.html`` even though their own ``:realm`` global attribute correctly said ``seaIce``. worker_main now reads ``nc.realm`` and prefers it over the table value for the emitted JSONL record. The file's own CMIP global attrs are authoritative; the bounds-table realm is the fallback for ERROR or NOBOUNDS records that never got a realm written. collapse() in build_html_report.py now keys VarEntry by (var, realm) instead of var alone, so a variable that appears under two realms splits into two cards routed to the right HTML page. Previously ``ent.realm`` was "first record wins" — bundling all branded variants under whichever record happened to come first in the JSONL. Reported by Nadine on the cli37 report (rlds-si, prsn-si, ts-si etc. showing up in atm.html with global data).

Each disabled rule emitted the same compound_name as an existing core or cap7_atm rule with divergent content. Same metadata in both files (long_name, standard_name, cell_methods, realm, units) so downstream tools couldn't tell the LPJ-GUESS value from the OIFS one. CMIP7 ships the OIFS coupled value as the canonical evspsbl/mrro/mrros/mrso/ mrsol/snc/snd/snw. Re-enable under distinct CMIP7 names (e.g. evspsblveg for canopy, tran for transpiration) if LPJ-specific land-hydrology values are needed for vegetation analysis later. core_land/areacella: kept enabled (duplicates core_atm but content is byte-identical and each tier needs its own copy for cell_measures resolution in stand-alone tier delivery — added comment noting this is intentional). Audit: grep -h '^\s*compound_name:' awi-esm3-veg-hr-variables/*/cmip7_*.yaml | grep -v '^\s*#' | sort | uniq -c | awk '$1>1' -> 1 line (atmos.areacella, intentional)

…sin/sltbasin lat attrs Christian's cli37 review flagged four ocean-side recipe issues that fall on the pycmor side. All addressed here; the remaining items (wfo / obvfsq / wo / volcello / opottempdiff bounds) are either model-team questions or sanity-bound tuning, not recipe bugs. hfds: FESOM `fh` is positive=UP (ice_oce_coupling.F90:418 writes ``heat_flux_in = heat_flux = -net_heat_flux``; density-flux equation at line 654 confirms positive `heat_flux_in` corresponds to ocean cooling/densification). CMIP7 `hfds` is positive=DOWN, so the rule now applies `scale_factor: -1.0` via scale_pipeline. Previously hfds was a bare passthrough, giving cli37 reviewers a North Atlantic that appears to GAIN heat in January. msftbarot: f_min default 1e-5 → 2.5e-5 in compute_msftbarot. The geostrophic SSH approximation `psi = rho_0 * g * H / f * eta` breaks down within ~10° of the equator, and at f=1e-5 only |lat| < ~4° gets masked — leaving a band ±4-10° where f is small but finite, producing the "weird artifact near the equator" Christian flagged. f_min=2.5e-5 extends the NaN mask to ±~10°. compute_mass_transport: vertical component (transport_component='z', used by wmo) now multiplies by `cell_area` instead of `dz * sqrt(cell_area)`. The horizontal-edge formula (dz * sqrt(cell_area), correct for umo/vmo) under-counted vertical flux by ~dz/sqrt(cell_area) ≈ 50/1e4 ≈ 200× at FESOM HR resolution. hfbasin / sltbasin: `lat` coordinate variable was attached as a bare numpy array with no attributes; the written file then had `lat` without `standard_name`/`units`/`axis`. Added CF attrs on both compute_hfbasin_tripyview and compute_sltbasin_tripyview output paths.

…iew) difmxylo: HR-FESOM (DARS ~10km) Laplacian eddy mixing values are much lower than the coarse-resolution defaults the original bounds assumed (Christian: "for higher resolution values should be lower as far as I read up on this"). cli37 cmor observed mean ~0.01 m²/s, max ~1.6 m²/s — bound expected_mean was ~1000, max ~1e4 (~5 orders too high). New bounds 0 / ~0.01 / ~5 m²/s give PASS on cli37 stats and stay narrow enough that a 100× regression would still WARN. sfdsi: Christian: "pattern looks good - not sure why the bounds fail here". cli37 cmor observed min -1.6e-4, max 6e-5 kg m-2 s-1; previous bound min/max ±1e-5 was too narrow — Arctic freezing bursts in fully-resolved sea-ice can hit ~1e-4 kg m-2 s-1 even though the global mean stays near zero. Widened to ±5e-4 (~5× the observed peak).

Recipe commit e7306e2 added scale_factor: -1.0 to hfds (FESOM `fh` is positive=up; CMIP positive=down). The bounds table still encoded the pre-flip expectation: - expected_mean=~2 (typical positive piControl drift) - bounds ±300 W/m² Post-flip, observed mean becomes negative (cli37 was +14.2 → post-fix -14.2). The classify() sign-mismatch check at sanity_check.py:337-338 auto-FAILs when expected_mean > 0 and observed < 0, so the bounds need updating regardless of magnitude. Also widening to ±500: cli37 max=1910 W/m² (post-flip min=-1910) exceeds ±300, and CMIP6 literature notes per-cell monthly peaks reach ±500 W/m² in deep-convection / strong air-sea contrast regions. Beyond ±500 stays a FAIL — sentinel for outlier cells that warrant investigation (1910 W/m² is real but extreme). expected_mean ~2 → ~0: piControl steady state has global mean near 0; small drift in either direction is fine. The em==0.0 branch of classify() (line 328) tolerates |mean|/envelope up to 10% as PASS.

Christian's cli37 review flagged hfbasin as FAIL on max=7.6 PW vs the ±2 PW bound. cli37 inspection shows the recipe is correct: annual climatology matches Trenberth almost exactly (global peak +1.96 PW @ 16°N; Atlantic +1.08 PW @ 16°N vs Trenberth ~1.3 PW). The 7.6 PW is a monthly Indo-Pacific extreme — basin-sum peak across (lat, time). Literature for monthly hfbasin extremes: - Trenberth & Caron 2001: global annual peak ~2 PW @ 35°N - RAPID 26.5°N Atlantic monthly range 0.2-2.5 PW (Johns 2011) - Pacific tropical cell NHT 1.75±0.30, SHT -1.69±0.55 PW - North Pacific 24°N seasonal cycle 0 → 1.1 PW (Bryden et al.) - HR mesoscale eddies add 30-50% over coarse models Combined: global = basin-sum at peak latitude reaches 5-8 PW monthly extreme in HR models. cli37's 7.6 PW max sits within this range. Widen bounds to ±10 PW so monthly extremes PASS while a 2× regression still WARNs and 5× regression FAILs.

- nep: -5e-6 / 2e-7 (was -5e-8 / 5e-8) — Central America wet-tropics drainage spikes are real model output (cli37 min -1.93e-6, max 1.01e-7) - mrrob: max 5e-3 (was 1e-4) — singular grid-cell spikes are real, global field fine (cli37 max 2.34e-3) - fHarvestToAtmos: ~1e-10 mean / ~1e-8 max (was ~0/~0) — 1850 LUH3 state has cropland+pasture that keeps being harvested every year - tsl: floor 150 K (was 220 K) — LPJ-GUESS shallow-layer tracks OpenIFS forcing when uninsulated; 1.5% of values <220 K in NH winter - vegHeight: rationale text fixed — LPJ-GUESS doesn't emit grass-only height, cmor falls back to tree-dominated field Reclassify on test06 JSONL: 5 FAILs cleared (2 nep, 2 fHarvestToAtmos, 1 mrrob FAIL->WARN). Three "should be zero" vars (fAnthDisturb, fNAnthDisturb, fNfert) NOT relaxed despite cli37 showing 1e-10..1e-8 values — Laszlo says these should be truly ~0 under frozen 1850 LU; spawned as D6 investigation.

…source Three new pipeline steps in examples/custom_steps.py: clip_small_negatives — sets [-thr, +thr] to 0 (default thr=1e-10). Clears tiny float-noise that propagates from raw .out files. Reviewer (Laszlo) confirmed these are not a bug to chase upstream. clip_floor_zero — floors negatives at 0. For variables whose physical floor is 0 but where the pycmor pipeline introduces negatives not present in the raw model output (mrsol, rhSoil per Laszlo). broadcast_yearly_to_monthly — repeats each yearly value across 12 mid-month time stamps. Used where the native monthly file is LAI/phenology-weighted and the yearly file is the authoritative stand-area source. Pipeline wiring: Phase B (clip_small_negatives inserted after loader in 4 pipelines): cap7_land: lpjg_monthly_pipeline veg_land: lpjg_monthly_pipeline, lpjg_monthly_lut_pipeline, mrtws_pipeline Phase C (new dedicated pipelines, 2 rules rewired): cap7_land: lpjg_monthly_clip0_pipeline <- rhSoil_mon lpjg_monthly_depth_clip0_pipeline <- mrsol_mon Both add clip_floor_zero on top of clip_small_negatives. D4a (new pipeline, treeFrac total source switch): cap7_land: lpjg_yearly_to_monthly_pipeline <- treeFrac_mon treeFrac_mon now sources treeFrac_yearly.out (was treeFrac_monthly.out). The native monthly file is LAI-weighted; the yearly file is correct stand area. variable_attributes.comment added on treeFrac_mon, rhSoil_mon, mrsol_mon documenting the derivation / clipping for downstream traceability. The per-PFT tree fractions (treeFracBdlDcd, treeFracBdlEvg, treeFracNdlDcd, treeFracNdlEvg) remain on the broken monthly source pending D4c — Laszlo signoff on the synthesis math AND/OR a possible LPJ-GUESS upstream fix he pushed today emitting yearly per-PFT files. See HANDOFF_d4_treeFrac_per_pft.md.

PLAN_veg_land_sanity_fixes.md — triages Laszlo + Christian feedback into 5 phases (A bounds, B clip-band, C sign-fix, D investigations, E verify) with explicit execution order. Round-1 review folded in. HANDOFF_d4_treeFrac_per_pft.md — investigation of treeFracNdlDcd annual-cycle bug. Documents: - root cause: LPJ-GUESS monthly per-PFT output is LAI/phenology weighted, not stand area - file inventory: treeFrac_yearly.out exists (total), per-PFT yearly files do NOT - four-option matrix (A skip / B synthesize / C max-only / D wait-for-upstream) - Option B math + per-cell normalization + divide-by-zero handling - D4b ratio sanity check on cli37: median 1.010, 88% within ±20% — math premise sound - escalation channels + comment-attribute draft REVIEW_veg_land_sanity_fixes_round1.md — external review of the plan; folded into PLAN. REVIEW_d4_treeFrac_per_pft.md — external review of the handoff; folded into HANDOFF. Three "should be zero" variables spawned as D6 from Phase A data anchor: fAnthDisturb, fNAnthDisturb, fNfert have 1e-10..1e-8 values in cli37 despite frozen 1850 LU — pending Laszlo round-3 investigation.

cli37 wo file inspection: olevel = [0, 5, 10, 20, 30, ..., 6250] — i.e. mesh.depth_bnds values (interface depths), not mesh.depth (midpoints), even though olevel:name = "nz1". FESOM 2.7's io_meandata.F90:1520 def_streams w on the top N layer interfaces (N=57 for DARS), with the surface at z=0 and the seabed BC implicit. The reviewer's "uppermost layer not too bad, those below are noisy" is the surface BC literally preserved (w=0 at interface 0) while every deeper interface carries diagnostic-w noise from integrating horizontal divergence down from the surface. The recipe was a bare passthrough that emitted these interfaces verbatim with a misleading midpoint-style coord name. New compute step ``average_w_interfaces_to_midpoints``: midpoint[i] = 0.5 * (w[i] + w[i+1]) for i = 0 … N-2 midpoint[N-1] = 0.5 * w[N-1] (bottom BC w_seabed=0) The vertical coord is replaced with mesh.depth (CMIP midpoint axis) and the dim renamed to nz1. ``compute_mass_transport`` already does the same averaging internally for ``wmo``; this factors it out for use in the new wo_pipeline. Effects: - Surface BC folded into first midpoint → no jump between layer 0 and 1; the "clean top" outlier disappears. - Output on 57 cell-centre midpoints, matching CMIP wo convention. - Vertical coord values switch from interface depths to midpoint depths.

…,NdlEvg}

…fNAnthDisturb, fNfert LUH3 1850 wood-harvest transitions (primf/secmf/secnf_harv) and a small implicit LPJ-GUESS management N baseline make these fluxes non-zero in piControl

The OIFS branch feature/cmip7-rh-online (merged into local_combined_fixes 2026-04-17) computes CMIP/CF-conformant relative humidity online every timestep using Alduchov-Eskridge Magnus with the hard water/ice switch at 273.15 K, and sends `hur_cmip7` (model levels) and `hurs_cmip7` (2m) to XIOS. Until now those sends were dropped because no XIOS field declared the IDs. field_def_cmip7.xml.j2: - declare raw `hurs_cmip7` (fraction) in 2D_physical - declare raw `hur_cmip7` (fraction) in 3D_ml - add percent alias `near_surface_relative_humidity_pct__hurs` (name="hurs") - repoint `relative_humidity_ml__hur` from `r` (FOEEWM mixed-phase QSAT, NOT CMIP-conformant) to `hur_cmip7` file_def_oifs_cmip7_spinup.xml.j2: - write `hurs` into atmos_mon (monthly surface) - write `hurs` into atmos_day (daily surface) - atmos_mon_ml already references relative_humidity_ml__hur, so it picks up CMIP7-conformant hur automatically Addresses Felix Pithan's 2026-04-15 feedback: IFS-native r/r_pl uses mixed-phase QSAT interpolation between RTICE and RTWAT (not CMIP-CF), and post-hoc RH from monthly-mean ta/hus is biased due to nonlinear e_sat(T). Computing online and averaging downstream fixes both. Pycmor-side: hurs_pipeline (post-hoc Magnus from 2t+2d) is left in place for now; can be retired once new XIOS hurs output is validated.

…ve ice pipelines Commit 5667ab2 added mask_where_no_seaice broadly to every hxy-si rule. That was the right fix for prra (Christian's explicit complaint: "rule cell_methods: where sea_ice is not applied - one sees a global field instead of a polar field") and for the ~9 atm-regridded rules (rlds/rlus/rsds/rsus_seaice, sbl_seaice, sifllattop, siflsenstop) that carry atmospheric fluxes over the entire surface and need masking to the ice domain for CMIP cell_methods compliance. For the remaining 13 FESOM-native ice pipelines, the underlying FESOM field is intrinsically zero (or NaN) outside the ice zone — the model only updates these where ice exists. Masking those is at best cosmetic (0 -> NaN) and at worst causes downstream dask graph stalls: in cli38 the lrcs_seaice tier OOM'd at MaxRSS 525 GiB on a 512 GiB cgroup; halving N_WORKERS produced an identical 525 GiB peak (memory is per-rule, not per-worker), and halving SHARD_SIZE to spread the load eliminated the OOMs but exposed a save_dataset wedge in sivol/sihc/snm/siflfwbot/etc. with no I/O progress for 45+ minutes — the dask graph for `data.where(mask)` was not being released before save_dataset, deadlocking the worker pool. Removed mask step from these pipelines: scale_mask_pipeline, sfdsi_from_fw_ice_mask_pipeline, fraction_to_percent_mask_pipeline, snm_pipeline, sisnhc_pipeline, sispeed_pipeline, sistressave_pipeline, sistressmax_pipeline, siflcondtop_pipeline, sitempbot_pipeline, sifb_pipeline, simpeffconc_pipeline, constant_field_pipeline. Kept mask in these pipelines (genuinely needed for atm-regridded fields): regrid_atm_to_fesom_seaice_mask_pipeline regrid_atm_to_fesom_seaice_mask_negate_pipeline Rules affected by the removal still inherit their previous correctness fixes from 5667ab2 (scale_factor on sisaltmass, cell-width factor on sidmasstran[xy], lat/lon attachment) — only the unnecessary mask step is gone.

Upstream release Dec 19 2025 (v1.2.2.2 was Sep 30 2025). Same 1974 variables in both, but the compound_name schema changed in three ways: - region suffixes lowercased: GLB/NH/SH -> glb/nh/sh (552 .GLB + 8 .NH + 8 .SH renames across 17 rule yamls) - ATA/GRL polar codes lowercased: ata/grl (no project rules affected) - 13 real compound_name schema renames touching this project: - atmosChem.{cfc11,cfc12,ch4,n2o}.tavg-u-hm-u -> .tavg-u-hm-air - land.tas.tavg-u-hxy-u.1hr -> .tavg-h2m-hxy-u.1hr (glb + 30S-90S) - land.tslsi.tavg-u-hxy-u.day -> .tavg-u-hxy-lsi.day - seaIce.sisnmass.tavg-u-hm-u -> .tavg-u-hm-si (day/mon x nh/sh) - atmos.ts.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day - land.esn.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day Source default in std_lib/global_attributes.py:136 updated from "GLB" to "glb" to match the new lowercase convention when get_region() falls back to the default (rule paths already supply region explicitly via compound_name). Validation: all 17 rule yamls construct CMORizer cleanly against the new metadata.json; 548/552 compound_names resolve in v1.2.2.3 (the 4 unmatched are Laszlo's per-PFT yearly diagnostic rules, extra-CMIP). Metadata cache populated at: ~/.cache/pycmor/cmip7_metadata/v1.2.2.3/metadata.json (kept v1.2.2.2) Substantive content edits in v1.2.2.3 (for variables we use): - atmos.{hurs,rsds,sfcWind}.tavg-...-1hr.30S-90S: cleaner CF-standard long_name values (long_name was internal text in v1.2.2.2).

cli39 veg_land shards 2/3/4 failed with ValueError: Rule 'treeFracBdlEvg_yr' with compound_name='land.treeFracBdlEvg.tavg-u-hxy-u.yr.GLB' did not match any variables in the CMIP7 data request CMIP7 only has the per-PFT treeFrac variants at MONTHLY cadence: Emon.treeFracBdlDcd / Emon.treeFracBdlEvg / Emon.treeFracNdlDcd / Emon.treeFracNdlEvg No yearly Eyr.treeFrac{BdlDcd,BdlEvg,NdlDcd,NdlEvg} entries exist. The monthly counterparts are already wired in cap7_land/cmip7_awiesm3-veg-hr_cap7_land.yaml and cover the CMIP7 requirement. Removes: - treeFracBdlDcd_yr - treeFracBdlEvg_yr - treeFracNdlDcd_yr - treeFracNdlEvg_yr Keeps the aggregate treeFrac_yr (maps to valid Eyr.treeFrac).

cli39 extra_land (job 25003750_1) hit 3h walltime with 4 save_dataset ops stuck at heartbeat #16: ⟳ save_dataset[tas] still running (t=960s, #16) ⟳ save_dataset[mrsow] still running (t=960s, #16) ⟳ save_dataset[dslw] still running (t=960s, #16) ⟳ save_dataset[orog] still running (t=960s, #16) All four are long-running surface saves (1-hourly tas SH-subset, daily mrsow/dslw via swvl1..swvl4 + temporal_diff, fx orog SH-subset). With extra_land's 18 rules all in a single shard at SHARD_SIZE=20, these four kicked off parallel saves that contended on the synchronous netcdf write scheduler. 9 saves completed cleanly, then the 4 wedged. Lower SHARD_SIZE specifically to 5 for extra_land — splits 18 rules into ~4 shards, distributing the contended saves across them. Each shard's individual save peak stays small enough to finish inside 3h. Other tiers keep the global SHARD_SIZE=20.

Both tiers hit TIMEOUT in cli40 due to multiple save_dataset operations contending on the global HDF5 write lock with no progress at heartbeat #87+ over 3h. Sequence in lrcs_seaice_3: siarea 176×, siextent 89×, sisnmass 87×, sfdsi 87×, sihc 86×, sidmasstrany 86× veg_land_1 wedged on vegHeight/dgw/tslsi/esn at heartbeat #21. Wall-vs-Σsave analysis (cli40 successful shards) shows these two tiers already run nearly serially under lock contention: lrcs_seaice_4: ratio 1.18× (12% wall-time penalty if forced serial) lrcs_seaice_2: 1.8× Forcing strict serial eliminates the wedge with negligible cost. 3D-heavy tiers (cap7_atm 4.1×, core_ocean 3.7×, extra_atm 1.8×) genuinely need parallel saves to stay inside 3h walltime, so they are NOT modified. Mechanism: every pipeline in the two yamls now declares throttle_group: <tier>_serial and the pycmor: section adds throttle_caps: <tier>_serial: 1 which caps in-flight rules of that group at 1 per batch. For lrcs_seaice this also subsumes the previous oifs_regrid cap=2 (which was set for driver-RSS, not the lock); cap=1 trivially satisfies it.

cli41 logs showed "Throttle caps (per-group rule submission limit): none" in both lrcs_seaice and veg_land — the yaml-level `throttle_caps` block I added in 5a61ad3 got silently dropped by the Everett-based PycmorConfigManager (only Options declared in PycmorConfig are preserved through schema normalisation; unknown keys vanish). Result: default per-group cap of 2 applied, lrcs_seaice still wedged on 9+ parallel save_dataset operations: cli41 _1: TIMEOUT (heartbeat #16, 9 saves stuck) cli41 _3: TIMEOUT (same pattern) The cmorizer `_resolve_throttle_caps()` already reads PYCMOR_THROTTLE_CAPS from the environment as a higher-priority source than the yaml. Move the per-tier cap declaration to the launcher. submit_hr_year_shards.sh: - New per-tier `tier_throttle_caps` case: lrcs_seaice → "lrcs_seaice_serial:1" veg_land → "veg_land_serial:1" (others) → "" - Threaded into `--export=...,PYCMOR_THROTTLE_CAPS=$tier_throttle_caps` Yaml cleanup: - Removed the dead `throttle_caps:` block from both yamls - Kept the explanatory comment so the next reader knows where the cap actually lives (the launcher) Pipeline-level `throttle_group:` annotations from 5a61ad3 are unchanged and are still read by Everett (declared on the Pipeline schema).

cli43 lrcs_seaice_3 TIMEOUT despite cap=1 being correctly applied: Throttle caps: {'lrcs_seaice_serial': 1} ... but 5 saves concurrent and stuck at heartbeat #101+ Root cause: `_rule_throttle_group()` only walked `rule.pipelines[*]`. 20 of 64 lrcs_seaice rules (siarea_north, siextent_north_day, siflcondbot, sidconcth, sitimefrac, ...) have no `pipelines:` key — they use pycmor's default pipeline, which carries no throttle_group, so they landed in `_unthrottled` and ran in parallel through the HDF5 global write lock. Two-part fix: 1. cmorizer.py: read rule.throttle_group first, fall back to pipeline annotation. Rule-level wins (per-rule override), preserves existing pipeline-level behaviour. 2. lrcs_seaice + veg_land yamls: add `throttle_group: <tier>_serial` to the `inherit:` block. With validate.py:299 already accepting `throttle_group` on the rule schema, inherit propagates it to every rule — pipelined or not. Effect on cli44+: all 64 lrcs_seaice rules (and all veg_land rules) join their tier's serial group; PYCMOR_THROTTLE_CAPS env var caps it at 1; the multi-rule HDF5-lock wedge can no longer form.

JanStreffing requested review from christian-stepanek, mandresm, nwieters, pgierz and siligam April 2, 2026 12:55

JanStreffing changed the base branch from main to prep-release April 2, 2026 12:55

JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 1a42875 to 5617a18 Compare April 2, 2026 13:18

esm-tools deleted a comment from github-actions Bot Apr 6, 2026

esm-tools deleted a comment from github-actions Bot Apr 8, 2026

nwieters and others added 6 commits April 9, 2026 15:04

Removed duplicate rules for cap7_ocean.

148cc59

Placed rule and pipeline for msftbarot in lrcs_ocen

9fa7ba5

JanStreffing added 7 commits May 14, 2026 23:37

JanStreffing force-pushed the feat/cmip7-awiesm3-veg-hr branch from 09fe51e to 900c575 Compare May 18, 2026 14:44

JanStreffing and others added 22 commits May 18, 2026 16:52

cap7_land: add vegHeightGrass_mon rule (hxy-grs)

b706941

cap7_land: add yearly rules for per-PFT treeFrac{BdlDcd,BdlEvg,NdlDcd…

470641b

…,NdlEvg}

sanity_check_ranges: loosen cli37 piControl bounds for fAnthDisturb, …

465afa4

…fNAnthDisturb, fNfert LUH3 1850 wood-harvest transitions (primf/secmf/secnf_harv) and a small implicit LPJ-GUESS management N baseline make these fluxes non-zero in piControl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/cmip7 awiesm3 veg hr#266

Feat/cmip7 awiesm3 veg hr#266
JanStreffing wants to merge 303 commits into
prep-releasefrom
feat/cmip7-awiesm3-veg-hr

JanStreffing commented Apr 2, 2026 •

edited

Loading

Uh oh!

JanStreffing commented Apr 2, 2026

Uh oh!

JanStreffing commented Apr 5, 2026 •

edited

Loading

Uh oh!

mandresm commented Apr 7, 2026

Uh oh!

pgierz commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JanStreffing commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CMIP7 cmorization for AWI-ESM3-VEG-HR

Key changes

Test plan

Uh oh!

JanStreffing commented Apr 2, 2026

Uh oh!

JanStreffing commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mandresm commented Apr 7, 2026

Uh oh!

pgierz commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JanStreffing commented Apr 2, 2026 •

edited

Loading

JanStreffing commented Apr 5, 2026 •

edited

Loading