Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,22 @@ All notable changes to this project will be documented in this file.

## [Unreleased]

## [0.8.0] - 2026-05-28

### Behavior changes

These three changes correct discrepancies between the implementation and the manuscript that describes Kompot's statistics. Two are numerical scale shifts that preserve relative rankings; the third is a default-value harmonization. Re-tune any absolute thresholds calibrated against 0.7.0.

- **Mahalanobis denominator now sums covariances**: the gene-wise Mahalanobis distance used by `DifferentialExpression.predict(compute_mahalanobis=True)` now computes the posterior combined covariance as `Σ_a + Σ_b` (the variance of the difference of two independent posterior estimators) instead of `(Σ_a + Σ_b) / 2`. Matches the manuscript definition `D(a,b) = sqrt((μ_a − μ_b)^T (Σ_a + Σ_b)^(-1) (μ_a − μ_b))`. **Effect**: absolute Mahalanobis distances in the GP-only regime contract by a factor of `√2` (and `D²` by 2). Relative rankings of genes are unchanged because the same scale factor applies everywhere, and FDR thresholds re-calibrate against the null. The sample-variance branch was already correctly summed.
- **Differential-abundance posterior tail probability is now one-sided**: `DifferentialAbundance.predict()` returns `PTP = Φ(−|z|)` (one-sided), matching the manuscript definition `PTP(x_i) = Φ(−|Δ(x_i)|/√(σ_a² + σ_b²))`. Previous releases returned `2·Φ(−|z|)` (two-sided). **Effect**: numeric PTP values are halved relative to 0.7.0; equivalently, `neg_log10_fold_change_ptp` increases by `log10(2) ≈ 0.301`. The threshold `PTP < 1e-3` previously corresponded to `|z| ≥ 3.29` and now corresponds to `|z| ≥ 3.09`. Re-tune any hard-coded `ptp_threshold` chosen against 0.7.0 if you want to preserve the old call-rate.
- **`use_empirical_variance` default is now `False` everywhere**: harmonized across `kompot.de()` (already False), `DifferentialExpression.__init__`, `ExpressionModel.__init__`, `kompot.smooth_expression()`, the deprecated `compute_differential_expression()` and `compute_smoothed_expression()` wrappers, and the CLI `smooth_config_template.yaml`. Previously these four entry points defaulted to `True`, inconsistent with both the recommended `kompot.de()` path and the manuscript's "empirical variance is disabled by default" statement. Code that relies on empirical variance must now pass `use_empirical_variance=True` explicitly.
- **Differential-expression posterior tail probability is now stored in log space**: `DifferentialExpression.predict(compute_mahalanobis=True)` and `kompot.de(..., store_additional_stats=True)` now compute the Mahalanobis posterior tail probability with `scipy.stats.chi2.logsf` in float64 and store `-log10(PTP)` in a renamed field `<result_key>_<a>_to_<b>_neg_log10_ptp` (previously the linear tail probability `chi2.sf` in `<result_key>_<a>_to_<b>_ptp`). The PTP is a strictly monotone transform of the Mahalanobis distance, but for an embedding with df on the order of tens the linear `chi2.sf` evaluates to values numerically indistinguishable from `1.0` for the majority of genes (every gene with `D²` below the chi-squared mean), saturating the stored statistic and destroying gene-ranking resolution at the head of the distribution. Computing `logsf` directly (never forming `1 − cdf`) keeps every value distinct and recovers the Mahalanobis ranking exactly. This mirrors the differential-abundance path, whose `neg_log10_lfc_ptp` field is already stored this way. **Effect**: the stored values change scale and orientation — larger now means more significant (a probability `p` becomes `−log10(p) ∈ [0, ∞)`), and the field is renamed. `volcano_de(y_axis_type="ptp")` reads the column directly with no additional `-log10` transform, and its `significance_threshold` is still supplied as a probability. Update any code that reads the old `_ptp` column.

### New features

- **`--dry-run` flag for `kompot de` CLI**: estimates memory, disk, and output field requirements without running the analysis. Outputs machine-parseable JSON to stdout and a human-readable report to stderr. Exit code reflects feasibility.
- **`kompot.configure_logging(stream)`**: reconfigure the kompot logger output stream. The CLI now logs to stderr by default, keeping stdout clean for machine-parseable output (dry-run JSON, table output).
- **`kompot.plot.lollipop`**: ax-embeddable gene-set-enrichment lollipop plot. One row per enriched term; a stem runs to a dot whose x-position encodes significance (`-log10(FDR)` by default, or any score column such as StringDB `signal` / enrichr `Combined Score`) and whose area encodes the matched-gene count, with a dashed `FDR = 0.05` guide and an in-axes aesthetic key. The headline feature is input flexibility: pass a `kompot.plot.StringDBReport` (its `get_functional_enrichment()` is called for you), the `signal`-sorted DataFrame that method returns, **or** a generic enrichment table from another tool (gseapy/enrichr, GOATOOLS, clusterProfiler). Column-name mapping params (`term_col`, `score_col`, `count_col`, `fdr_col`) with case-insensitive autodetection — including the gseapy `"k/K"` `Overlap`-string parser — bridge the schema differences. The fig-3 manuscript specifics (direction-red accent, reserved title band, GO-Process category) are now parameters with manuscript-matching defaults. Like `dotplot`, it composes into an externally-provided `ax=` instead of building its own `GridSpec`.
- **`kompot.plot.dotplot`**: ax-embeddable fold-change-per-group dotplot. Color = mean of a per-cell LFC layer within each `groupby` category; size = fraction of cells expressing. Gene selection is either an explicit list or auto-picked top-N by Mahalanobis from run history (with optional `filter_key`, e.g. restricting to `is_de=True`). Pass `axes=(main, cbar, size_legend)` to compose into a larger figure, or leave `axes=None` for a standalone figure. Unlike `scanpy.pl.DotPlot`, this function does not build its own `GridSpec` and does not fight externally-provided axes, which is the whole reason it exists. Shares gene-selection, layer-fetch, and colormap-normalization primitives with `kompot.plot.heatmap` via the existing `heatmap.utils` helpers.

### Improvements
Expand Down
10 changes: 10 additions & 0 deletions docs/source/plotting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,16 @@ Expression Plots

.. autofunction:: kompot.plot.plot_gene_expression

Dotplots
--------

.. autofunction:: kompot.plot.dotplot

Enrichment Lollipop
-------------------

.. autofunction:: kompot.plot.lollipop

Heatmaps
--------

Expand Down
44 changes: 30 additions & 14 deletions kompot/anndata/_de_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -771,8 +771,10 @@ def _compute_fdr(
"mahalanobis": internal_null_mahalanobis,
"mean_lfc": expression_results["mean_log_fold_change"][n_real:],
}
if "ptp" in expression_results:
null_table_data["ptp"] = expression_results["ptp"][n_real:]
if "neg_log10_ptp" in expression_results:
null_table_data["neg_log10_ptp"] = expression_results["neg_log10_ptp"][
n_real:
]
null_data["table"] = pd.DataFrame(
null_table_data,
index=null_gene_names,
Expand Down Expand Up @@ -823,8 +825,10 @@ def _compute_fdr(

if "mahalanobis_distances" in expression_results:
expression_results["mahalanobis_distances"] = real_mahalanobis
if "ptp" in expression_results:
expression_results["ptp"] = expression_results["ptp"][:n_real]
if "neg_log10_ptp" in expression_results:
expression_results["neg_log10_ptp"] = expression_results["neg_log10_ptp"][
:n_real
]

return fdr_results

Expand Down Expand Up @@ -909,17 +913,21 @@ def _store_de_results(
adata,
)

if compute_mahalanobis and "ptp" in expression_results and store_additional_stats:
ptp = _ensure_1d(
expression_results["ptp"],
"ptp",
if (
compute_mahalanobis
and "neg_log10_ptp" in expression_results
and store_additional_stats
):
neg_log10_ptp = _ensure_1d(
expression_results["neg_log10_ptp"],
"neg_log10_ptp",
n_selected,
logger,
)
_add_var_column(
new_var_columns,
field_names["ptp_key"],
ptp,
neg_log10_ptp,
selected_genes,
adata,
)
Expand Down Expand Up @@ -1426,9 +1434,13 @@ def _compute_group_results(
f"significantly DE at FDR < {fdr_threshold}"
)

# Group-wise ptp
if compute_mahalanobis and store_additional_stats and "ptp" in subset_results:
sub_ptp = subset_results["ptp"]
# Group-wise neg_log10_ptp
if (
compute_mahalanobis
and store_additional_stats
and "neg_log10_ptp" in subset_results
):
sub_ptp = subset_results["neg_log10_ptp"]
if len(sub_ptp) == len(expanded_genes):
sub_ptp = sub_ptp[:n_real]
elif len(sub_ptp) != n_real:
Expand Down Expand Up @@ -1641,7 +1653,8 @@ def _build_field_mapping(
"location": "var",
"type": "ptp",
"description": (
"Posterior tail probability from chi-squared distribution"
"Negative log10 posterior tail probability (-log10 PTP) from "
"the chi-squared distribution, computed in log space"
),
}

Expand Down Expand Up @@ -1728,7 +1741,10 @@ def _add_group_field_mapping(
field_mapping[ptp_k] = {
"location": "varm",
"type": "ptp",
"description": "Peak-to-peak values for all subsets",
"description": (
"Negative log10 posterior tail probability (-log10 PTP) for "
"all subsets"
),
"contains_subsets": subset_names,
}

Expand Down
2 changes: 1 addition & 1 deletion kompot/anndata/cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def cleanup(

- ``'mean_log_fold_change'``: Mean log fold change values
- ``'mahalanobis'``: Mahalanobis distances
- ``'ptp'``: Posterior tail probability
- ``'ptp'``: Negative log10 posterior tail probability (-log10 PTP)
- ``'mahalanobis_pvalue'``: P-values from empirical null
- ``'mahalanobis_local_fdr'``: Local FDR values
- ``'mahalanobis_tail_fdr'``: Tail-based FDR values
Expand Down
6 changes: 3 additions & 3 deletions kompot/anndata/differential_expression.py
Original file line number Diff line number Diff line change
Expand Up @@ -530,8 +530,8 @@ def de(

if compute_mahalanobis and "mahalanobis_distances" in expression_results:
results_data["mahalanobis"] = expression_results["mahalanobis_distances"]
if "ptp" in expression_results:
results_data["ptp"] = expression_results["ptp"]
if "neg_log10_ptp" in expression_results:
results_data["neg_log10_ptp"] = expression_results["neg_log10_ptp"]

result_dict["table"] = pd.DataFrame(results_data, index=selected_genes)
result_dict["underrepresentation"] = underrep
Expand Down Expand Up @@ -751,7 +751,7 @@ def compute_differential_expression(
return_full_results: bool = False,
store_posterior_covariance: bool = False,
allow_single_condition_variance: bool = False,
use_empirical_variance: bool = True,
use_empirical_variance: bool = False,
progress: bool = True,
null_genes="auto",
null_seed=42,
Expand Down
4 changes: 2 additions & 2 deletions kompot/anndata/smooth.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def smooth_expression(
ls = gp.ls if gp is not None else None
ls_factor = gp.ls_factor if gp is not None else 10.0
n_landmarks = gp.n_landmarks if gp is not None else 5000
use_empirical_variance = gp.use_empirical_variance if gp is not None else True
use_empirical_variance = gp.use_empirical_variance if gp is not None else False
eps = gp.eps if gp is not None else 1e-8
random_state = gp.random_state if gp is not None else None
batch_size = gp.batch_size if gp is not None else 500
Expand Down Expand Up @@ -393,7 +393,7 @@ def compute_smoothed_expression(
sigma: float = 1.0,
ls: Optional[float] = None,
ls_factor: float = 10.0,
use_empirical_variance: bool = True,
use_empirical_variance: bool = False,
eps: float = 1e-8,
random_state: Optional[int] = None,
batch_size: int = 500,
Expand Down
2 changes: 1 addition & 1 deletion kompot/anndata/utils/field_tracking.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ def generate_output_field_names(
field_names.update(
{
"mahalanobis_key": f"{result_key}_{cond1_safe}_to_{cond2_safe}_mahalanobis{suffix}",
"ptp_key": f"{result_key}_{cond1_safe}_to_{cond2_safe}_ptp{suffix}",
"ptp_key": f"{result_key}_{cond1_safe}_to_{cond2_safe}_neg_log10_ptp{suffix}",
"mean_lfc_key": f"{result_key}_{cond1_safe}_to_{cond2_safe}_mean_lfc",
"smoothed_key_1": f"{result_key}_{cond1_safe}_smoothed",
"smoothed_key_2": f"{result_key}_{cond2_safe}_smoothed",
Expand Down
2 changes: 1 addition & 1 deletion kompot/cli/templates/smooth_config_template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ n_landmarks: 5000 # Number of landmarks for Nystrom approximation
sample_col: null # Column in adata.obs with sample IDs

# Empirical variance (heteroscedastic noise):
use_empirical_variance: true # Estimate per-gene noise from GP residuals
use_empirical_variance: false # Estimate per-gene noise from GP residuals

# GP kernel parameters:
sigma: 1.0 # Noise level for function estimator
Expand Down
5 changes: 3 additions & 2 deletions kompot/differential/differential_abundance.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,11 +603,12 @@ def compute_sample_variance2(X_batch):
sd = np.sqrt(log_fold_change_uncertainty + self.eps)
log_fold_change_zscore = log_fold_change / sd

# Compute PTP (Posterior Tail Probability) in natural log (base e)
# Compute PTP (Posterior Tail Probability) in natural log (base e).
# One-sided per manuscript: PTP = Φ(−|z|) = min(Φ(z), Φ(−z)) for real z.
ln_ptp = np.minimum(
normal.logcdf(log_fold_change_zscore),
normal.logcdf(-log_fold_change_zscore),
) + np.log(2)
)

# Convert from natural log to negative log10 (for better volcano plot visualization)
# ln_ptp is a log of a small value (typically < 1), so it's negative
Expand Down
44 changes: 35 additions & 9 deletions kompot/differential/differential_expression.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

import numpy as np
import jax
import jax.numpy as jnp
import jax.scipy.stats as jax_stats
from scipy.stats import chi2 as scipy_chi2
from typing import Optional, Dict, Any
import logging
from mellon.parameters import compute_landmarks
Expand Down Expand Up @@ -42,7 +41,7 @@ def __init__(
self,
n_landmarks: Optional[int] = None,
use_sample_variance: Optional[bool] = None,
use_empirical_variance: bool = True,
use_empirical_variance: bool = False,
eps: float = 1e-8, # Increased default epsilon for better numerical stability
jit_compile: bool = False,
function_predictor1: Optional[Any] = None,
Expand Down Expand Up @@ -625,8 +624,10 @@ def compute_mahalanobis_distances(
# Points for sample variance computation
variance_points = X

# Average the covariance matrices
combined_cov = (cov1 + cov2) / 2
# Sum the covariance matrices: Σ_a + Σ_b is the variance of the
# difference of independent posterior estimators, matching the
# Mahalanobis denominator defined in the manuscript.
combined_cov = cov1 + cov2
del cov1, cov2

# For sample variance, use diag=False to get full covariance matrices
Expand Down Expand Up @@ -976,13 +977,38 @@ def predict(

if hasattr(self, "_last_mahalanobis_dof"):
logger.debug(
f"Computing ptp with {self._last_mahalanobis_dof} degrees of freedom..."
f"Computing neg_log10_ptp with {self._last_mahalanobis_dof} "
"degrees of freedom..."
)
mahalanobis_squared = jnp.array(mahalanobis_distances) ** 2
ptp = jax_stats.chi2.sf(
# Posterior tail probability (PTP) of the Mahalanobis distance under
# the chi-squared null. Computed in LOG space and stored as
# -log10(PTP), mirroring the DA path's neg_log10_lfc_ptp convention.
#
# The PTP is chi2.sf(D^2, df). For an embedding with df on the order
# of tens, the vast majority of genes have D^2 well below the chi2
# mean, where the linear sf evaluates to values numerically
# indistinguishable from 1.0 in float64 (1 - epsilon rounds to 1.0).
# Storing the linear sf therefore collapses most genes onto a single
# saturated value and destroys gene-ranking resolution at the head of
# the distribution. chi2.logsf returns the log of the same quantity
# directly (never forming 1 - cdf), so log(1 - epsilon) ~= -epsilon
# remains representable and every gene keeps a distinct value.
#
# scipy in float64 is used deliberately: jax runs in float32 unless
# x64 is explicitly enabled, and float32 logsf re-collapses the
# dynamic range (the precision is in the mantissa we are trying to
# preserve).
mahalanobis_squared = np.asarray(
mahalanobis_distances, dtype=np.float64
) ** 2
ln_ptp = scipy_chi2.logsf(
mahalanobis_squared, df=self._last_mahalanobis_dof
)
result["ptp"] = np.array(ptp)
# Convert natural-log tail probability to -log10(PTP): positive and
# larger for more significant genes, matching the DA convention.
result["neg_log10_ptp"] = np.asarray(
-(ln_ptp / np.log(10)), dtype=np.float64
)

return result

Expand Down
3 changes: 2 additions & 1 deletion kompot/differential/expression_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@ class ExpressionModel:
Number of landmarks for Nystrom approximation.
use_empirical_variance : bool
Whether to estimate per-gene empirical variance from GP residuals.
By default False.
eps : float
Small constant for numerical stability.
random_state : int, optional
Expand All @@ -135,7 +136,7 @@ class ExpressionModel:
def __init__(
self,
n_landmarks: Optional[int] = None,
use_empirical_variance: bool = True,
use_empirical_variance: bool = False,
eps: float = 1e-8,
random_state: Optional[int] = None,
batch_size: int = 500,
Expand Down
14 changes: 14 additions & 0 deletions kompot/plot/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,20 @@ def dotplot(*args, **kwargs):
)


try:
from .lollipop import lollipop

__all__.append("lollipop")
except ImportError as e:
logger.warning(f"Could not import lollipop function due to: {e}")

def lollipop(*args, **kwargs):
raise ImportError(
"Lollipop plot unavailable due to missing dependencies. "
"matplotlib is required."
)


# Import StringDB report class
try:
from .stringdb import StringDBReport
Expand Down
2 changes: 1 addition & 1 deletion kompot/plot/field_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ def _fallback_field_inference(
"direction_key": ["direction"],
"mean_lfc_key": ["mean_lfc", "lfc", "log_fold_change", "fold_change"],
"mahalanobis_key": ["mahalanobis", "score"],
"ptp_key": ["ptp"],
"ptp_key": ["neg_log10_ptp", "ptp"],
"is_de_key": ["is_de", "significant"],
"zscore_key": ["zscore", "z_score"],
"density_key_1": ["log_density"],
Expand Down
Loading
Loading