Feat/cmip7 awiesm3 veg hr#266
Open
JanStreffing wants to merge 303 commits into
Open
Conversation
1a42875 to
5617a18
Compare
Contributor
Author
Contributor
Author
|
The single failing test ( Something to worry about? @pgierz |
Contributor
|
@pgierz, will you review this or should I go for it? |
Member
|
I will look |
calculate_chunks_simple() could produce chunk sizes larger than the actual dimension when the dataset was very small (e.g. fx variables with only 47 or 57 elements). This caused netCDF4 createVariable to raise "chunksize cannot exceed dimension size". Add min() clamp.
When xarray reads NetCDF files, it stores 'coordinates' in the variable encoding dict. set_coordinate_attributes then sets it in attrs too, causing xarray to raise "'coordinates' found in both attrs and encoding" on save. Pop from encoding before setting in attrs.
CMIP7 salinity units are "1E-03" (dimensionless with scaling factor). pint-xarray cannot convert to such targets, raising "Unit expression cannot have a scaling factor". When both source (e.g. psu) and target are dimensionless, relabel without numeric conversion since the values are already in the correct range.
All ~33 custom step functions hardcoded the CMOR variable name (e.g. result.name = "zostoga") instead of using rule.model_variable. This caused KeyError in set_variable when model_variable differs from the CMOR name (e.g. model_variable="temp" but name was "zostoga"). Also fix compute_mass_transport to handle W-level data (nz=48 interfaces) by averaging to cell centers (nz=47) before multiplying by layer thickness.
cli3X veg_land OSError root cause: every load_lpjguess_* function did
``for _, row in df_all.iterrows():`` to scatter rows into a pre-allocated
(time, ncells) numpy array. At HR LPJ-GUESS resolution that's millions
of Python iterations holding the GIL for several minutes per rule.
A dask worker thread that holds the GIL can't service its own
LocalCluster scheduler's heartbeat. After
distributed.comm.timeouts.connect (30s default) the scheduler marks
the worker dead and disconnects; the worker comes back when iterrows
finishes but the cluster has already cascaded into:
OSError: Timed out trying to connect to tcp://127.0.0.1:<port> after 30 s
— visible in every cli3X veg_land shard log, killing all 3 shards.
The whole-rule retry decorator (8815440) catches the OSError but the
LocalCluster is too damaged to recover within the retry window.
Vectorized pattern (same logic, no Python row loop):
coords_df_with_idx = coords_df.copy()
coords_df_with_idx["_cell_idx"] = np.arange(len(coords_df_with_idx))
df_merged = df_all.merge(
coords_df_with_idx[["Lon", "Lat", "_cell_idx"]],
on=["Lon", "Lat"], how="left",
)
valid = ~np.isnan(df_merged["_cell_idx"].values)
cell_idx_int = df_merged["_cell_idx"].values[valid].astype(np.int64)
yr_idx = np.searchsorted(years, df_merged["Year"].values[valid])
values[yr_idx, cell_idx_int] = df_merged[var].values[valid]
Pandas merge runs in C and releases the GIL during the join. Load
should drop from "minutes blocking GIL" to "seconds".
Applied to all six loaders that used the iterrows pattern:
load_lpjguess_monthly (yearly + monthly format branches)
load_lpjguess_yearly
load_lpjguess_yearly_lut
load_lpjguess_monthly_lut
load_lpjguess_monthly_depth (3D: time × sdepth × ncells)
load_lpjguess_monthly_pool (3D: time × soilCpool × ncells)
The 3D loaders still iterate the small depth/pool dim (3-6 layers) in
Python — that's bounded and fast. The hot loop (Python over all rows)
is gone in all six.
extra_atm sacct history (May 1-15): every reliable completion used
512G memory.
Whole-tier era (pre-shard):
24733123 mem512 01:51:28
24733545 01:11:17
24743470 cli3 02:07:45
24748551 cli5 02:34:58
24784803 cli9 01:22:15
24813316 cli17 01:10:25
Sharded --mem=0 era (default ~256G):
cli28-cli36: OOM × 6, TIMEOUT × 1, two completions at non-default
241016M (~235G).
6 completions on 512G + 0 reliable completions on 256G — proven config.
cli36 forensics confirmed the failure mode is not fix3-wedge or
allocator (jemalloc on, fix3=off, 17/20 rules saved): MaxRSS=167G but
MaxVMSize=458G shows malloc fragmentation pushing past the cgroup,
plus heavy concurrent hourly OIFS + 3D rules ('cl', 'pfull' both at
55min when 3 new rules started) tipping it over.
Adds 1 task to the scarcer ~282-node 512G pool; total 512G footprint
becomes 5/35 tasks (14% of work). Pool pressure remains minor.
Alternative considered: SHARD_SIZE=10 to halve concurrent-heavy count.
Architecturally cleaner but unproven for this tier — chose the path
with 6 historical successes.
…/_infer_frequency_multifile Three Python-loop hotspots replaced with pandas vectorized paths: 1. ``select_range`` and ``validate_range``: ``df["start"].apply(pd.Timestamp)`` → ``pd.to_datetime(df["start"])``. Same conversion, but C-level under the hood. These run on every Filecache load that filters by time range. 2. ``_infer_frequency_multifile``: ``for _, row in df.iterrows()`` building the timestamp list → ``pd.to_datetime(..., errors="coerce")`` + boolean masks. The try/except per-row pattern becomes a single ``valid`` mask; the steps-1/2/>2 branching reduces to a ``steps == 2`` mask appending end timestamps in addition to starts. Net: no behaviour change, no memory change (same Series being materialized, just via vectorized parse path). All 17 filecache unit tests pass. Same diagnostic motivation as the lpjg vectorization (7900daa): Python iteration over pandas/numpy in dask-managed code holds the GIL longer than necessary and starves the LocalCluster heartbeat. filecache loads happen at flow startup, so the impact here is on cold-start time, not mid-rule deadlocks — but the antipattern is the same. Not touched: ``regrid_regular_to_fesom`` in examples/custom_steps.py (also has a per-timestep loop) — not referenced by any production cli3X rule yaml, only an old test config. Dead code for the production path.
lrcs_seaice was given 6h walltime in 0c4aff1 because cli30 lrcs_seaice_3 TIMEOUTed at 3h. That was before jemalloc (cli34/5/6/7 made the allocator opt-in then global). The previous slow path was driven by glibc fragmentation, not real compute time. Empirical wall times since jemalloc landed: cli35 lrcs_seaice _1/2/3/4: 3:17, 1:08:52, 1:11:16, 46:13 cli36 lrcs_seaice _1/2/3/4: 3:07, 1:12:04, 1:10:59, 56:15 slowest single shard: 1:12:04 — well under 3h 3h gives 1h48m headroom on the worst observed run, which is enough. All tiers now use the global WALLTIME default (3h) — uniform, simpler.
New env knob ``SHARD_DRS`` (default ``off``) toggles pycmor's
``enable_output_subdirs`` config option. When on:
* The injected per-shard yaml sets
``pycmor.enable_output_subdirs = True``, so files.py's filename
builder appends the DRS sub-tree returned by GlobalAttributes
.subdir_path():
<mip_era>/<activity_id>/<institution_id>/<source_id>/
<experiment_id>/<member_id>/<table_id>/<variable_id>/
<grid_label>/v<YYYYMMDD>/
e.g. ``CMIP7/CMIP/AWI/AWI-ESM-3/picontrol/r1i1p1f1/Amon/tas/gn/
v20260515/tas_<...>.nc``
* The submitter collapses the per-tier ``<tier>/cmorized/`` OUTSUB
to ``.`` so all 17 tiers land in one shared DRS root. The non-DRS
nesting was a pycmor-specific organizational layer that downstream
consumers don't expect at the start of the path.
Off by default — keeps the legacy per-tier flat layout for any current
consumers. Turn on via ``SHARD_DRS=on bash submit_hr_year_shards.sh ...``
for production / publication-ready output.
…zed layout build_html_report: render_file_card was showing variable-group bounds (whichever cadence happened to merge first) for every per-file card. 21 variables with multiple cadence rows displayed wrong "Expected" columns and range SVG. Prefer rec.expected_* over ent.expected_*. sanity_check: walker glob() missed files under <root>/<tier>/cmorized/ (produced when the runner emits a DRS-shaped subtree). Switch to rglob() so the walker handles both flat and nested layouts. run_walker_compute.sh: new sbatch wrapper to walk on compute nodes; login-node OpenBLAS pthread limit + Lustre contention crashes the worker pool on the big 8-19 GB 1hr/3D-model-level files. run_perfile_maps_compute.sh: paths bumped to cli37 run.
build_maps: collapse all non-spatial dims (time + level + tile) by the
matching reducer per panel (min/min, mean/mean, max/max) so 3D fields
get full spatial coverage instead of mostly-NaN deep-level slices.
Other fixes: rglob() for nested cmorized/ layout, sibling-file lat/lon
recovery for hxy-si nod2 files that strip coords, dask-chunked open so
84-GB cl_day/pfull_day render without OOM, 2-98 percentile colorbar
clipping, prefer lat/lon over latitude/longitude for spatial-dim
discovery on 3D atm files.
build_html_report: per-file card now uses rec-level expected bounds
(was showing variable-group bounds, wrong for 21 cadence-split vars
incl. hfls, hfss, pr, prsn, rsds, rsus, uas, vas). NH/SH suffix on
hemispheric scalar files (siarea, siextent, sisnmass, sivol) with
North->South word substitution. FAIL diagnoses now append a
"mean-in-bounds, outlier-extremes" hint when the field mean sits
inside the expected window (e.g. hfls_mon FAIL is single-cell, not
field-wide).
sanity_check_ranges: recalibrate bounds with reviewer feedback —
phcint allow negatives (T<0degC at high lat), masscello to AWI-ESM
vertical discretization (5125..358750 kg/m^2), prra and siconc to
ice-zone/global walker semantics, cl to per-layer (~5%) not
column-integrated (~30%), umo/vmo/sfx/sfy reflect compute fix below.
custom_steps.compute_mass_transport, compute_salt_transport: integrate
across the FESOM cell edge by multiplying by sqrt(cell_area). Was
emitting kg/(s*m) but the CMIP7 spec for Omon.{umo,vmo,sfx,sfy}
requires kg/s. New _fesom_edge_width helper resolves the horizontal
dim against (nod2, ncells, ncol) and falls back to a clear error if
the mesh lacks cell_area. Order-of-magnitude check: typical FESOM HR
mid-lat cell now ~4.5e7 kg/s, strong-current cell ~1.8e9 kg/s.
cli37 walker output: 743 files, PASS 394 / WARN 251 / FAIL 96 /
ERROR 2.
09fe51e to
900c575
Compare
… lat/lon Implements the cmor/CMIP7 corrections from the cli37 sea-ice review. hxy-si rules now apply mask_where_no_seaice so the output matches CMIP7 cell_methods "area: time: mean where sea_ice (mask=siconc)". Previously prra/prsn (and all other rules using scale_pipeline / snm_pipeline / sispeed_pipeline / sistressave_pipeline / sistressmax_pipeline / siflcondtop_pipeline / sisnhc_pipeline / sitempbot_pipeline / sifb_pipeline / simpeffconc_pipeline / constant_field_pipeline / sisnhc_from_msnow_pipeline / fraction_to_percent_pipeline / sfdsi_from_fw_ice_pipeline) wrote the field over the full FESOM domain including the tropics. New scale_mask_pipeline / fraction_to_percent_mask_pipeline / sfdsi_from_fw_ice_mask_pipeline cover the cases where the original pipeline is shared with a non-hxy-si caller; the rest got mask_where_no_seaice added in-place since every consumer is hxy-si. The 17 direct-passthrough hxy-si rules (no pipelines: block) remain unmasked — the audit directive was scoped to "every hxy-si rule pipeline". sisaltmass: scale_factor 0.004 -> 3.64. FESOM's m_ice is effective ice height per cell area (units 'm', long_name "ice height per unit area" in io_meandata.F90), not kg/m² as the prior rule comment assumed. The correct salt-mass conversion is (sice/1000) * m_ice * rho_ice = 0.004 * m_ice * 910 = 3.64 * m_ice. sice=4.0 verified from Final_CMIP7_IO_Test_06/config/fesom/namelist.ice; rho_ice=910 (AOMIP) from fesom-2.7 MOD_ICE.F90:61. sidmasstran[xy]: compute_ice_mass_transport now does uice * m_ice * rho_ice * sqrt(cell_area) and emits kg/s. Reuses the existing _fesom_edge_width helper (added by 900c575 for umo/vmo/sfx/sfy). Previously the formula was uice * m_ice mislabelled "kg/s" — actually m²/s, ~5 orders of magnitude too low. regrid_oifs_to_fesom: assign_coords lat/lon on nod2 after the isel. Source-grid lat/lon were attached to the now-removed source_dim; without re-attaching the FESOM lat/lon, the output had only nod2 as a dim and external tools (ushow, Panoply, ncview) couldn't render the field. The sanity-check map renderer already worked around this with _find_sibling_latlon; that workaround stays as a defence-in-depth fallback. sisnthick: not added — CMIP7 v1.2.2.2 has no sisnthick out_name. The CMIP7 equivalent in seaIce is snd (standard_name surface_snow_thickness, units m), already produced by core_seaice/snd and cap7_seaice/snd_day.
…ealm The walker recorded ``realm`` from the bounds-table lookup, which is keyed by out_name only. For out_names that exist in multiple CMIP realms (rlds in atmos+seaIce, prra, prsn, evspsbl, rsds, rsus, rlus, ts, ...) every branded variant got stamped with whichever realm the bounds table happened to store — typically the atm realm. The ``-hxy-si`` files then landed on ``atm.html`` even though their own ``:realm`` global attribute correctly said ``seaIce``. worker_main now reads ``nc.realm`` and prefers it over the table value for the emitted JSONL record. The file's own CMIP global attrs are authoritative; the bounds-table realm is the fallback for ERROR or NOBOUNDS records that never got a realm written. collapse() in build_html_report.py now keys VarEntry by (var, realm) instead of var alone, so a variable that appears under two realms splits into two cards routed to the right HTML page. Previously ``ent.realm`` was "first record wins" — bundling all branded variants under whichever record happened to come first in the JSONL. Reported by Nadine on the cli37 report (rlds-si, prsn-si, ts-si etc. showing up in atm.html with global data).
Each disabled rule emitted the same compound_name as an existing core or cap7_atm rule with divergent content. Same metadata in both files (long_name, standard_name, cell_methods, realm, units) so downstream tools couldn't tell the LPJ-GUESS value from the OIFS one. CMIP7 ships the OIFS coupled value as the canonical evspsbl/mrro/mrros/mrso/ mrsol/snc/snd/snw. Re-enable under distinct CMIP7 names (e.g. evspsblveg for canopy, tran for transpiration) if LPJ-specific land-hydrology values are needed for vegetation analysis later. core_land/areacella: kept enabled (duplicates core_atm but content is byte-identical and each tier needs its own copy for cell_measures resolution in stand-alone tier delivery — added comment noting this is intentional). Audit: grep -h '^\s*compound_name:' awi-esm3-veg-hr-variables/*/cmip7_*.yaml | grep -v '^\s*#' | sort | uniq -c | awk '$1>1' -> 1 line (atmos.areacella, intentional)
…sin/sltbasin lat attrs Christian's cli37 review flagged four ocean-side recipe issues that fall on the pycmor side. All addressed here; the remaining items (wfo / obvfsq / wo / volcello / opottempdiff bounds) are either model-team questions or sanity-bound tuning, not recipe bugs. hfds: FESOM `fh` is positive=UP (ice_oce_coupling.F90:418 writes ``heat_flux_in = heat_flux = -net_heat_flux``; density-flux equation at line 654 confirms positive `heat_flux_in` corresponds to ocean cooling/densification). CMIP7 `hfds` is positive=DOWN, so the rule now applies `scale_factor: -1.0` via scale_pipeline. Previously hfds was a bare passthrough, giving cli37 reviewers a North Atlantic that appears to GAIN heat in January. msftbarot: f_min default 1e-5 → 2.5e-5 in compute_msftbarot. The geostrophic SSH approximation `psi = rho_0 * g * H / f * eta` breaks down within ~10° of the equator, and at f=1e-5 only |lat| < ~4° gets masked — leaving a band ±4-10° where f is small but finite, producing the "weird artifact near the equator" Christian flagged. f_min=2.5e-5 extends the NaN mask to ±~10°. compute_mass_transport: vertical component (transport_component='z', used by wmo) now multiplies by `cell_area` instead of `dz * sqrt(cell_area)`. The horizontal-edge formula (dz * sqrt(cell_area), correct for umo/vmo) under-counted vertical flux by ~dz/sqrt(cell_area) ≈ 50/1e4 ≈ 200× at FESOM HR resolution. hfbasin / sltbasin: `lat` coordinate variable was attached as a bare numpy array with no attributes; the written file then had `lat` without `standard_name`/`units`/`axis`. Added CF attrs on both compute_hfbasin_tripyview and compute_sltbasin_tripyview output paths.
…iew) difmxylo: HR-FESOM (DARS ~10km) Laplacian eddy mixing values are much lower than the coarse-resolution defaults the original bounds assumed (Christian: "for higher resolution values should be lower as far as I read up on this"). cli37 cmor observed mean ~0.01 m²/s, max ~1.6 m²/s — bound expected_mean was ~1000, max ~1e4 (~5 orders too high). New bounds 0 / ~0.01 / ~5 m²/s give PASS on cli37 stats and stay narrow enough that a 100× regression would still WARN. sfdsi: Christian: "pattern looks good - not sure why the bounds fail here". cli37 cmor observed min -1.6e-4, max 6e-5 kg m-2 s-1; previous bound min/max ±1e-5 was too narrow — Arctic freezing bursts in fully-resolved sea-ice can hit ~1e-4 kg m-2 s-1 even though the global mean stays near zero. Widened to ±5e-4 (~5× the observed peak).
Recipe commit e7306e2 added scale_factor: -1.0 to hfds (FESOM `fh` is positive=up; CMIP positive=down). The bounds table still encoded the pre-flip expectation: - expected_mean=~2 (typical positive piControl drift) - bounds ±300 W/m² Post-flip, observed mean becomes negative (cli37 was +14.2 → post-fix -14.2). The classify() sign-mismatch check at sanity_check.py:337-338 auto-FAILs when expected_mean > 0 and observed < 0, so the bounds need updating regardless of magnitude. Also widening to ±500: cli37 max=1910 W/m² (post-flip min=-1910) exceeds ±300, and CMIP6 literature notes per-cell monthly peaks reach ±500 W/m² in deep-convection / strong air-sea contrast regions. Beyond ±500 stays a FAIL — sentinel for outlier cells that warrant investigation (1910 W/m² is real but extreme). expected_mean ~2 → ~0: piControl steady state has global mean near 0; small drift in either direction is fine. The em==0.0 branch of classify() (line 328) tolerates |mean|/envelope up to 10% as PASS.
Christian's cli37 review flagged hfbasin as FAIL on max=7.6 PW vs the ±2 PW bound. cli37 inspection shows the recipe is correct: annual climatology matches Trenberth almost exactly (global peak +1.96 PW @ 16°N; Atlantic +1.08 PW @ 16°N vs Trenberth ~1.3 PW). The 7.6 PW is a monthly Indo-Pacific extreme — basin-sum peak across (lat, time). Literature for monthly hfbasin extremes: - Trenberth & Caron 2001: global annual peak ~2 PW @ 35°N - RAPID 26.5°N Atlantic monthly range 0.2-2.5 PW (Johns 2011) - Pacific tropical cell NHT 1.75±0.30, SHT -1.69±0.55 PW - North Pacific 24°N seasonal cycle 0 → 1.1 PW (Bryden et al.) - HR mesoscale eddies add 30-50% over coarse models Combined: global = basin-sum at peak latitude reaches 5-8 PW monthly extreme in HR models. cli37's 7.6 PW max sits within this range. Widen bounds to ±10 PW so monthly extremes PASS while a 2× regression still WARNs and 5× regression FAILs.
- nep: -5e-6 / 2e-7 (was -5e-8 / 5e-8) — Central America wet-tropics drainage spikes are real model output (cli37 min -1.93e-6, max 1.01e-7) - mrrob: max 5e-3 (was 1e-4) — singular grid-cell spikes are real, global field fine (cli37 max 2.34e-3) - fHarvestToAtmos: ~1e-10 mean / ~1e-8 max (was ~0/~0) — 1850 LUH3 state has cropland+pasture that keeps being harvested every year - tsl: floor 150 K (was 220 K) — LPJ-GUESS shallow-layer tracks OpenIFS forcing when uninsulated; 1.5% of values <220 K in NH winter - vegHeight: rationale text fixed — LPJ-GUESS doesn't emit grass-only height, cmor falls back to tree-dominated field Reclassify on test06 JSONL: 5 FAILs cleared (2 nep, 2 fHarvestToAtmos, 1 mrrob FAIL->WARN). Three "should be zero" vars (fAnthDisturb, fNAnthDisturb, fNfert) NOT relaxed despite cli37 showing 1e-10..1e-8 values — Laszlo says these should be truly ~0 under frozen 1850 LU; spawned as D6 investigation.
…source
Three new pipeline steps in examples/custom_steps.py:
clip_small_negatives — sets [-thr, +thr] to 0 (default thr=1e-10).
Clears tiny float-noise that propagates from
raw .out files. Reviewer (Laszlo) confirmed
these are not a bug to chase upstream.
clip_floor_zero — floors negatives at 0. For variables whose
physical floor is 0 but where the pycmor
pipeline introduces negatives not present in
the raw model output (mrsol, rhSoil per
Laszlo).
broadcast_yearly_to_monthly
— repeats each yearly value across 12 mid-month
time stamps. Used where the native monthly
file is LAI/phenology-weighted and the yearly
file is the authoritative stand-area source.
Pipeline wiring:
Phase B (clip_small_negatives inserted after loader in 4 pipelines):
cap7_land: lpjg_monthly_pipeline
veg_land: lpjg_monthly_pipeline, lpjg_monthly_lut_pipeline,
mrtws_pipeline
Phase C (new dedicated pipelines, 2 rules rewired):
cap7_land: lpjg_monthly_clip0_pipeline <- rhSoil_mon
lpjg_monthly_depth_clip0_pipeline <- mrsol_mon
Both add clip_floor_zero on top of clip_small_negatives.
D4a (new pipeline, treeFrac total source switch):
cap7_land: lpjg_yearly_to_monthly_pipeline <- treeFrac_mon
treeFrac_mon now sources treeFrac_yearly.out (was
treeFrac_monthly.out). The native monthly file is
LAI-weighted; the yearly file is correct stand area.
variable_attributes.comment added on treeFrac_mon, rhSoil_mon, mrsol_mon
documenting the derivation / clipping for downstream traceability.
The per-PFT tree fractions (treeFracBdlDcd, treeFracBdlEvg,
treeFracNdlDcd, treeFracNdlEvg) remain on the broken monthly source
pending D4c — Laszlo signoff on the synthesis math AND/OR a possible
LPJ-GUESS upstream fix he pushed today emitting yearly per-PFT files.
See HANDOFF_d4_treeFrac_per_pft.md.
PLAN_veg_land_sanity_fixes.md — triages Laszlo + Christian feedback
into 5 phases (A bounds, B clip-band, C sign-fix, D investigations,
E verify) with explicit execution order. Round-1 review folded in.
HANDOFF_d4_treeFrac_per_pft.md — investigation of treeFracNdlDcd
annual-cycle bug. Documents:
- root cause: LPJ-GUESS monthly per-PFT output is LAI/phenology
weighted, not stand area
- file inventory: treeFrac_yearly.out exists (total), per-PFT
yearly files do NOT
- four-option matrix (A skip / B synthesize / C max-only /
D wait-for-upstream)
- Option B math + per-cell normalization + divide-by-zero handling
- D4b ratio sanity check on cli37: median 1.010, 88% within ±20%
— math premise sound
- escalation channels + comment-attribute draft
REVIEW_veg_land_sanity_fixes_round1.md — external review of the plan;
folded into PLAN.
REVIEW_d4_treeFrac_per_pft.md — external review of the handoff;
folded into HANDOFF.
Three "should be zero" variables spawned as D6 from Phase A data
anchor: fAnthDisturb, fNAnthDisturb, fNfert have 1e-10..1e-8 values
in cli37 despite frozen 1850 LU — pending Laszlo round-3 investigation.
cli37 wo file inspection: olevel = [0, 5, 10, 20, 30, ..., 6250] —
i.e. mesh.depth_bnds values (interface depths), not mesh.depth
(midpoints), even though olevel:name = "nz1". FESOM 2.7's
io_meandata.F90:1520 def_streams w on the top N layer interfaces
(N=57 for DARS), with the surface at z=0 and the seabed BC implicit.
The reviewer's "uppermost layer not too bad, those below are noisy"
is the surface BC literally preserved (w=0 at interface 0) while every
deeper interface carries diagnostic-w noise from integrating
horizontal divergence down from the surface. The recipe was a bare
passthrough that emitted these interfaces verbatim with a misleading
midpoint-style coord name.
New compute step ``average_w_interfaces_to_midpoints``:
midpoint[i] = 0.5 * (w[i] + w[i+1]) for i = 0 … N-2
midpoint[N-1] = 0.5 * w[N-1] (bottom BC w_seabed=0)
The vertical coord is replaced with mesh.depth (CMIP midpoint axis)
and the dim renamed to nz1. ``compute_mass_transport`` already does
the same averaging internally for ``wmo``; this factors it out for
use in the new wo_pipeline.
Effects:
- Surface BC folded into first midpoint → no jump between layer 0
and 1; the "clean top" outlier disappears.
- Output on 57 cell-centre midpoints, matching CMIP wo convention.
- Vertical coord values switch from interface depths to midpoint
depths.
…fNAnthDisturb, fNfert LUH3 1850 wood-harvest transitions (primf/secmf/secnf_harv) and a small implicit LPJ-GUESS management N baseline make these fluxes non-zero in piControl
The OIFS branch feature/cmip7-rh-online (merged into local_combined_fixes 2026-04-17) computes CMIP/CF-conformant relative humidity online every timestep using Alduchov-Eskridge Magnus with the hard water/ice switch at 273.15 K, and sends `hur_cmip7` (model levels) and `hurs_cmip7` (2m) to XIOS. Until now those sends were dropped because no XIOS field declared the IDs. field_def_cmip7.xml.j2: - declare raw `hurs_cmip7` (fraction) in 2D_physical - declare raw `hur_cmip7` (fraction) in 3D_ml - add percent alias `near_surface_relative_humidity_pct__hurs` (name="hurs") - repoint `relative_humidity_ml__hur` from `r` (FOEEWM mixed-phase QSAT, NOT CMIP-conformant) to `hur_cmip7` file_def_oifs_cmip7_spinup.xml.j2: - write `hurs` into atmos_mon (monthly surface) - write `hurs` into atmos_day (daily surface) - atmos_mon_ml already references relative_humidity_ml__hur, so it picks up CMIP7-conformant hur automatically Addresses Felix Pithan's 2026-04-15 feedback: IFS-native r/r_pl uses mixed-phase QSAT interpolation between RTICE and RTWAT (not CMIP-CF), and post-hoc RH from monthly-mean ta/hus is biased due to nonlinear e_sat(T). Computing online and averaging downstream fixes both. Pycmor-side: hurs_pipeline (post-hoc Magnus from 2t+2d) is left in place for now; can be retired once new XIOS hurs output is validated.
…ve ice pipelines Commit 5667ab2 added mask_where_no_seaice broadly to every hxy-si rule. That was the right fix for prra (Christian's explicit complaint: "rule cell_methods: where sea_ice is not applied - one sees a global field instead of a polar field") and for the ~9 atm-regridded rules (rlds/rlus/rsds/rsus_seaice, sbl_seaice, sifllattop, siflsenstop) that carry atmospheric fluxes over the entire surface and need masking to the ice domain for CMIP cell_methods compliance. For the remaining 13 FESOM-native ice pipelines, the underlying FESOM field is intrinsically zero (or NaN) outside the ice zone — the model only updates these where ice exists. Masking those is at best cosmetic (0 -> NaN) and at worst causes downstream dask graph stalls: in cli38 the lrcs_seaice tier OOM'd at MaxRSS 525 GiB on a 512 GiB cgroup; halving N_WORKERS produced an identical 525 GiB peak (memory is per-rule, not per-worker), and halving SHARD_SIZE to spread the load eliminated the OOMs but exposed a save_dataset wedge in sivol/sihc/snm/siflfwbot/etc. with no I/O progress for 45+ minutes — the dask graph for `data.where(mask)` was not being released before save_dataset, deadlocking the worker pool. Removed mask step from these pipelines: scale_mask_pipeline, sfdsi_from_fw_ice_mask_pipeline, fraction_to_percent_mask_pipeline, snm_pipeline, sisnhc_pipeline, sispeed_pipeline, sistressave_pipeline, sistressmax_pipeline, siflcondtop_pipeline, sitempbot_pipeline, sifb_pipeline, simpeffconc_pipeline, constant_field_pipeline. Kept mask in these pipelines (genuinely needed for atm-regridded fields): regrid_atm_to_fesom_seaice_mask_pipeline regrid_atm_to_fesom_seaice_mask_negate_pipeline Rules affected by the removal still inherit their previous correctness fixes from 5667ab2 (scale_factor on sisaltmass, cell-width factor on sidmasstran[xy], lat/lon attachment) — only the unnecessary mask step is gone.
Upstream release Dec 19 2025 (v1.2.2.2 was Sep 30 2025). Same 1974
variables in both, but the compound_name schema changed in three ways:
- region suffixes lowercased: GLB/NH/SH -> glb/nh/sh
(552 .GLB + 8 .NH + 8 .SH renames across 17 rule yamls)
- ATA/GRL polar codes lowercased: ata/grl (no project rules affected)
- 13 real compound_name schema renames touching this project:
- atmosChem.{cfc11,cfc12,ch4,n2o}.tavg-u-hm-u -> .tavg-u-hm-air
- land.tas.tavg-u-hxy-u.1hr -> .tavg-h2m-hxy-u.1hr (glb + 30S-90S)
- land.tslsi.tavg-u-hxy-u.day -> .tavg-u-hxy-lsi.day
- seaIce.sisnmass.tavg-u-hm-u -> .tavg-u-hm-si (day/mon x nh/sh)
- atmos.ts.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day
- land.esn.tavg-u-hxy-sn.day -> .tavg-u-hxy-lnd.day
Source default in std_lib/global_attributes.py:136 updated from "GLB"
to "glb" to match the new lowercase convention when get_region() falls
back to the default (rule paths already supply region explicitly via
compound_name).
Validation: all 17 rule yamls construct CMORizer cleanly against the
new metadata.json; 548/552 compound_names resolve in v1.2.2.3 (the 4
unmatched are Laszlo's per-PFT yearly diagnostic rules, extra-CMIP).
Metadata cache populated at:
~/.cache/pycmor/cmip7_metadata/v1.2.2.3/metadata.json (kept v1.2.2.2)
Substantive content edits in v1.2.2.3 (for variables we use):
- atmos.{hurs,rsds,sfcWind}.tavg-...-1hr.30S-90S: cleaner CF-standard
long_name values (long_name was internal text in v1.2.2.2).
cli39 veg_land shards 2/3/4 failed with
ValueError: Rule 'treeFracBdlEvg_yr' with
compound_name='land.treeFracBdlEvg.tavg-u-hxy-u.yr.GLB' did not match any
variables in the CMIP7 data request
CMIP7 only has the per-PFT treeFrac variants at MONTHLY cadence:
Emon.treeFracBdlDcd / Emon.treeFracBdlEvg / Emon.treeFracNdlDcd / Emon.treeFracNdlEvg
No yearly Eyr.treeFrac{BdlDcd,BdlEvg,NdlDcd,NdlEvg} entries exist. The monthly
counterparts are already wired in cap7_land/cmip7_awiesm3-veg-hr_cap7_land.yaml
and cover the CMIP7 requirement.
Removes:
- treeFracBdlDcd_yr
- treeFracBdlEvg_yr
- treeFracNdlDcd_yr
- treeFracNdlEvg_yr
Keeps the aggregate treeFrac_yr (maps to valid Eyr.treeFrac).
cli39 extra_land (job 25003750_1) hit 3h walltime with 4 save_dataset ops stuck at heartbeat #16: ⟳ save_dataset[tas] still running (t=960s, #16) ⟳ save_dataset[mrsow] still running (t=960s, #16) ⟳ save_dataset[dslw] still running (t=960s, #16) ⟳ save_dataset[orog] still running (t=960s, #16) All four are long-running surface saves (1-hourly tas SH-subset, daily mrsow/dslw via swvl1..swvl4 + temporal_diff, fx orog SH-subset). With extra_land's 18 rules all in a single shard at SHARD_SIZE=20, these four kicked off parallel saves that contended on the synchronous netcdf write scheduler. 9 saves completed cleanly, then the 4 wedged. Lower SHARD_SIZE specifically to 5 for extra_land — splits 18 rules into ~4 shards, distributing the contended saves across them. Each shard's individual save peak stays small enough to finish inside 3h. Other tiers keep the global SHARD_SIZE=20.
Both tiers hit TIMEOUT in cli40 due to multiple save_dataset operations contending on the global HDF5 write lock with no progress at heartbeat #87+ over 3h. Sequence in lrcs_seaice_3: siarea 176×, siextent 89×, sisnmass 87×, sfdsi 87×, sihc 86×, sidmasstrany 86× veg_land_1 wedged on vegHeight/dgw/tslsi/esn at heartbeat #21. Wall-vs-Σsave analysis (cli40 successful shards) shows these two tiers already run nearly serially under lock contention: lrcs_seaice_4: ratio 1.18× (12% wall-time penalty if forced serial) lrcs_seaice_2: 1.8× Forcing strict serial eliminates the wedge with negligible cost. 3D-heavy tiers (cap7_atm 4.1×, core_ocean 3.7×, extra_atm 1.8×) genuinely need parallel saves to stay inside 3h walltime, so they are NOT modified. Mechanism: every pipeline in the two yamls now declares throttle_group: <tier>_serial and the pycmor: section adds throttle_caps: <tier>_serial: 1 which caps in-flight rules of that group at 1 per batch. For lrcs_seaice this also subsumes the previous oifs_regrid cap=2 (which was set for driver-RSS, not the lock); cap=1 trivially satisfies it.
cli41 logs showed "Throttle caps (per-group rule submission limit): none" in both lrcs_seaice and veg_land — the yaml-level `throttle_caps` block I added in 5a61ad3 got silently dropped by the Everett-based PycmorConfigManager (only Options declared in PycmorConfig are preserved through schema normalisation; unknown keys vanish). Result: default per-group cap of 2 applied, lrcs_seaice still wedged on 9+ parallel save_dataset operations: cli41 _1: TIMEOUT (heartbeat #16, 9 saves stuck) cli41 _3: TIMEOUT (same pattern) The cmorizer `_resolve_throttle_caps()` already reads PYCMOR_THROTTLE_CAPS from the environment as a higher-priority source than the yaml. Move the per-tier cap declaration to the launcher. submit_hr_year_shards.sh: - New per-tier `tier_throttle_caps` case: lrcs_seaice → "lrcs_seaice_serial:1" veg_land → "veg_land_serial:1" (others) → "" - Threaded into `--export=...,PYCMOR_THROTTLE_CAPS=$tier_throttle_caps` Yaml cleanup: - Removed the dead `throttle_caps:` block from both yamls - Kept the explanatory comment so the next reader knows where the cap actually lives (the launcher) Pipeline-level `throttle_group:` annotations from 5a61ad3 are unchanged and are still read by Everett (declared on the Pipeline schema).
cli43 lrcs_seaice_3 TIMEOUT despite cap=1 being correctly applied:
Throttle caps: {'lrcs_seaice_serial': 1}
... but 5 saves concurrent and stuck at heartbeat #101+
Root cause: `_rule_throttle_group()` only walked `rule.pipelines[*]`.
20 of 64 lrcs_seaice rules (siarea_north, siextent_north_day,
siflcondbot, sidconcth, sitimefrac, ...) have no `pipelines:` key —
they use pycmor's default pipeline, which carries no throttle_group,
so they landed in `_unthrottled` and ran in parallel through the
HDF5 global write lock.
Two-part fix:
1. cmorizer.py: read rule.throttle_group first, fall back to pipeline
annotation. Rule-level wins (per-rule override), preserves existing
pipeline-level behaviour.
2. lrcs_seaice + veg_land yamls: add `throttle_group: <tier>_serial`
to the `inherit:` block. With validate.py:299 already accepting
`throttle_group` on the rule schema, inherit propagates it to every
rule — pipelined or not.
Effect on cli44+: all 64 lrcs_seaice rules (and all veg_land rules)
join their tier's serial group; PYCMOR_THROTTLE_CAPS env var caps it
at 1; the multi-rule HDF5-lock wedge can no longer form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CMIP7 cmorization for AWI-ESM3-VEG-HR
Adds full CMIP7 support targeting AWI-ESM3-VEG-HR, including a native compound-name
architecture that replaces the legacy cmip6-table-based data request lookup.
Key changes
CMIP7 data request
DataRequestfromCMIP7_DReq_metadataJSON instead of cmip6 tablesocean.tos.tavg-u-hxy-sea.mon.GLB)cmip6_table→cmip6_cmor_tablein vendored metadatacompound_namematching againstcmip6_compound_nameandcmip7_compound_nameattributestable_idfrom compound name when not set explicitlyValueErroron zero DRV matches (instead of silent skip)Pipeline
vertical_integratecustom pipeline stepconvert()step fromDefaultPipelineStateobjects not being unwrapped to actual results in parallel runsStandard library
src/pycmor/std_lib/time_bounds.py)getattr+_pycmor_cfgfallbackglobal_attributesto derivetable_idfrom CMIP6/CMIP7 compound namesXarray accessor API
StdLibAccessorwith.process()Test infrastructure
pycmor.fixtures.model_runs)pycmor.tutorialdataset system (xarray.tutorial-style API)Misc fixes
entry_points()compatibilitypyfesom2imports for environments without itTest plan
pytest tests/unit/pycmor process examples/awiesm3-cmip7-minimal.yamlruns successfully on Levante