Add variance-stratified residual-Mahalanobis FDR flavor#8
Closed
katosh wants to merge 12 commits into
Closed
Conversation
- Add kompot smooth section with usage, options, and examples - Add smooth to overview, quick start workflow, config templates, help - Fix all \\ to single \ in code-block directives (RST renders literally)
The per-command basic examples repeated the same info as the workflow block above and the detailed command sections below.
…abels - Exclude compute_differential_abundance/expression from automodule - Add smooth_expression automodule (exclude deprecated compute_smoothed_expression) - Add RunInfo.to_settings() and call_args() to documented members - Rename "Gene Expression Imputation" to "Smoothing" in toctree
- Add --dry-run flag to kompot de: estimates resource requirements without running the analysis - JSON output to stdout (or -o file), human-readable report to stderr - Exit code 0 if feasible, 1 if not - Add ResourcePlan.to_dict() for machine-parseable serialization - Add configure_logging() to redirect kompot logger stream - CLI now logs to stderr by default (keeps stdout clean for machine-parseable output like --dry-run and --table-output) - Document --dry-run in CLI docs with pipeline examples
Users would likely reuse the same args as the real run, which would overwrite their h5ad output with a JSON file. Stdout-only is safer.
- Add dry_run=False to all 8 DE test Namespace objects (run_de now accesses args.dry_run from the --dry-run flag added in this PR) - Update actions/checkout v4→v6 and actions/setup-python v5→v6 to resolve Node.js 20 deprecation warnings
- Add TestCLIDryRun: covers --dry-run JSON output, infeasible exit code 1, and output validation skip when dry_run=True - Add TestConfigureLogging: covers stream redirection and default - Fix codecov action: file→files (v5 API change) - Add --cov-report=term-missing to pytest so coverage stats appear in CI logs
When GPSettings.use_empirical_variance=False (the default) the
Mahalanobis statistic D^2 scales monotonically with per-gene expression
variance; the gene-shuffled null inherits that scaling, so the 1-D
local FDR is genome-wide calibrated but not gene-specific. In designs
with ~2 biological replicates per condition this causes the top DE
list to be dominated by high-variance/variability genes rather than
condition-specific signal.
Add an opt-in FDRSettings.mode='variance_stratified' that:
* fits a smooth null-trend surface phi_hat(log_mean, log_var) on
log1p(null_mahalanobis) via OLS on a 7-term tensor polynomial
* residualises log1p(D^2) and standardises by the homoscedastic
null residual scale
* applies kompot's existing 1-D local FDR to max(0, Z)
Outputs are written to adata.var alongside the untouched raw columns:
*_residual_mahalanobis, *_residual_z, *_residual_local_fdr,
*_residual_is_de (and pvalue/tail_fdr/log_mean/log_var under
store_additional_stats=True).
Public helpers live in kompot.residualization for post-hoc use.
Tal1 chimera validation: top-20 flips from 12 imprinted / 5
hematopoietic (raw) to 0 imprinted / 11 hematopoietic (residual),
including the predicted Itga2b, Lyl1, Stab1/2, Tfec, Ackr1.
Default behaviour is unchanged; mode='raw' remains the default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Opt-in FDR calibration mode
FDRSettings(mode="variance_stratified")that residualiseslog(1 + D²)against a smooth surface in(log_mean, log_var)fit on the permutation null, then applies kompot's existing 1-D local FDR to the standardised residualZ. Intended for low-replication designs (≈2 biological replicates per condition), where the raw Mahalanobis is dominated by per-gene variance rather than condition-specific signal.mode="raw").*_mahalanobis,*_mahalanobis_local_fdr, and*_is_decolumns are preserved.*_residual_mahalanobis,*_residual_z,*_residual_local_fdr,*_residual_is_de(plus
_pvalue/_tail_fdr/_log_mean/_log_varunderstore_additional_stats=True).kompot.residualizationfor post-hoc use on any existing kompot output.Background
When
GPSettings.use_empirical_variance=False(kompot's default — chosen so per-cell sampling noise is not conflated with epistemic posterior uncertainty),D²_gscales monotonically with per-gene expression varianceσ²_g. The gene-shuffled null inherits that scaling, so the 1-D local FDR is genome-wide calibrated but not gene-specific. With ~2 replicates per condition, any 2-vs-2 partition — including the real one — produces mean differences roughly proportional toσ²_g, so high-variance genes rank at the top under any partition.Method
m_g = log1p(mean_g),v_g = log1p(var_g)from the DE expression layer.φ̂(m, v) = E[log(1 + D²_null) | m, v]via OLS on a 7-term tensor polynomial over the gene-shuffled null draws.σ̂_null.R_g = log(1 + D²_g) − φ̂(m_g, v_g),Z_g = R_g / σ̂_null.max(0, Z_g). Below-null genes getlocal_fdr ≈ 1.Tal1 chimera validation
Post-hoc on
/export/chimera_tal1_kompot_ps01_*:σ̂_nullon Tal1 nullResidual top-20 (in order): Hbb-bh1, Hba-x, Hbb-y, Stab2, Itga2b, Stab1, Tfec, Aqp8, Rplp0, Rpl41, Hba-a1, Slc22a18, Ackr1, Ushbp1, Erdr1, Rps18, Adora2a, Clec14a, Lyl1, Hba-a2. Matches the known Tal1 hematopoietic phenotype; the raw list was dominated by high-variance imprinted / X-linked genes.
Design doc
Full derivation, evidence table, and limitations section (also destined for the methods supplement) live in the design note
reports/kompot_notebooks_private_2026-04-17_142500_residual-mahalanobis-flavor.mdof the companionkompot_notebooks_privateworkspace.What this does and does not do
Does
(mean, variance)from the statistic before FDR calibration.(m, v).Does not
Settings
Test plan
tests/test_residualization.py(feature extraction, null-trend fit, residualisation math, end-to-end throughkompot.de, settings validation) — all pass.test_fdr_utils,test_fdr_edge_cases,test_fdr_integration,test_model_settings): 75 / 75 pass.test_differential,test_differential_expression_core,test_differential_expression_comprehensive,test_de_comprehensive_params,test_field_tracking,test_anndata_tracking): 74 pass, 4 pre-existing skips.Files
kompot/residualization.py— pure-numpy module withcompute_gene_features,fit_null_trend,residualize_mahalanobis,residual_local_fdrkompot/settings.py—FDRSettingsknobs:mode,null_trend_features,null_trend_modelkompot/anndata/_de_helpers.py—_compute_residual_fdrhelper; extended_compute_fdr,_store_de_results,_build_field_mapping,_check_overwriteskompot/anndata/differential_expression.py— plumbs settings throughde()and legacycompute_differential_expression()kompot/anndata/utils/field_tracking.py— residual field-name keysdocs/source/variance_stratified_fdr.rst— user-facing guidedocs/source/index.rst,README.md,CHANGELOG.mdtests/test_residualization.py