Skip to content

PEPATAC 0.14.0#328

Open
jpsmith5 wants to merge 51 commits into
masterfrom
dev
Open

PEPATAC 0.14.0#328
jpsmith5 wants to merge 51 commits into
masterfrom
dev

Conversation

@jpsmith5

Copy link
Copy Markdown
Contributor

Release 0.14.0. See docs/changelog.md for full Added/Changed/Fixed/Removed entries.

nsheff and others added 30 commits April 11, 2026 22:12
Bumps [pytest](https://github.com/pytest-dev/pytest) from 3.1.3 to 9.0.3.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@3.1.3...9.0.3)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
- Drop version+build pins on ucsc-bedgraphtobigwig, ucsc-bedtobigbed,
  ucsc-bigwigmerge, ucsc-stringify in requirements-conda.yml — the pinned
  377 builds required openssl 1.1.1 and conflicted with the env's
  openssl 3.x. Letting the solver pick a compatible build resolves the
  install. (#321)
- Add r-argparser to requirements-conda.yml; tools/PEPATAC_summarizer.R
  needs it but it was not in the conda env. (#228)
- Add r-r.utils to requirements-conda.yml and R.utils to PEPATACr
  Imports; required by data.table::fread for .bed.gz peak coverage
  files. (#229)
- Pin GenomicDistributions (>= 1.4.6) and GenomicDistributionsData
  (>= 1.0.0) in PEPATACr Imports to avoid the chromSizes_hg38 /
  TSS_hg38 namespace mismatch when one side is upgraded without the
  other. (#230)
- Fix curl-into-variable bug in checkinstall at three sites: the
  REQS / PIPELINE fallback branches stored file contents in a variable
  that was used as a path downstream. Switched to mktemp + curl -o for
  the URL fallback, assigned the path directly for the local branch.
  (#226)
- Drop dead run-container.md link from docs/install.md and refresh
  intro now that containers were removed in 0.12.0.

Closes #228, #229, #230, #321, #226.
…-dedup

- _align() referenced an undefined  in the single-end + bwa branch
  paired chain (minus filter_pair) for both --keep and no-keep paths:
  pm.run([cmd1, cmd2, cmd3, cmd4], <target>). (#299)
- Fix missing pm.fail_pipeline in the unmap_fq1 branch of the
  filter_paired_fq.pl handle check; previously a stuck filter on R1
  set an error string that was never raised. Reworked the error
  message into a shared template that points at the underlying psutil
  introspection issue and recommends both --keep and --noFIFO as
  workarounds. (#234)
- Add --skip-dedup flag for protocols where duplicates are
  biologically meaningful (CUT&Tag, CUT&RUN). When set: copy
  mapping_genome_bam to _sort_dedup.bam so downstream peak calling
  finds the expected path; report Duplicate_reads=0 and pass through
  Dedup_aligned_reads/Dedup_alignment_rate/Dedup_total_efficiency
  from the pre-dedup metrics. Plumbed through
  sample_pipeline_interface.yaml so it can be set per-sample. (#249)
- Drop redundant Time/Success keys from pepatac_output_schema.yaml
  (both samples: and project: blocks). These are pipestat's auto-
  tracked status fields and the duplicate declaration triggered
  "SchemaError: Overlap between project- and sample-level keys" on
  newer pipestat. (#322, #305)
- Fix _LOGGER NameError in tools/bamQC.py and bamSitesToWig.py: the
  variable was only defined inside , so
  pararead workers re-importing the module under multiprocessing
  'spawn' (macOS default) hit NameError when class methods logged.
  Added a module-level fallback logger above each class definition.
  (#266)
- Fix peakCounts() ref-peaks ignored when *_peaks_coverage.bed.gz
  coexists with *_ref_peaks_coverage.bed: the shared  variable
  preferred .bed.gz from the regular peaks file and then looked for
  a non-existent _ref_peaks_coverage.bed.gz, falling through to the
  "not derived from a singular reference peak set" warning. Detect
  ref vs regular extensions independently. (#218, #219)
- Guard refgenie[sample.genome] lookups in
  sample_pipeline_interface.yaml with ,
  so projects with non-refgenie genomes (e.g. galGal6, bosTau9) no
  longer crash the Jinja template with an attmap AttributeError;
  instead they fall through to the per-sample paths or error
  cleanly from pepatac.py. (#231)
- Fix plotAnno() empty-input fallback path bug: was constructing
  file.path(<output_file>, "<sample>_partition_dist.pdf") (treating
  the output pdf as a directory) and quit()ing the R session.
  Replaced with return(ggplot()), matching the function's other
  empty-data branches; the caller writes a clean blank placeholder
  at the expected target. (#232)

Closes #299, #234, #249, #322, #305, #266, #218, #219, #231, #232.
…ps3dp/tools/refgenie_config.yaml workaround

- faq.md: expand the TSSE entry to name the refgene_anno asset / UCSC
  RefGene as the source of TSS coords and note that the cutoff-of-6
  threshold is hg38-tuned and empirical. Point at ENCODE ATAC-seq
  data standards for per-assembly reference numbers. (#235)
- assets.md: add a Using a custom adapter file subsection
  documenting the adapters resource override in
  pipelines/pepatac.yaml. (#252)
- assets.md: document the /home/jps3dp/tools/refgenie_config.yaml-required-even-with-manual-paths
  quirk and the empty-refgenie-config workaround. The proper fix is
  in the in-progress refgenie 1.0 migration (PR #327). (#251)
- count_table.md: make the per-sample PEPATAC_completed.flag handling
  explicit in the consensus-peak-set count table workflow. Two paths:
  delete the flag files (one-liner with find -delete) or pass
  --ignore-flags to looper run. (#215)
- assets.md: troubleshooting subsection for TypeError: 'NoneType'
  object is not iterable — root-caused to incomplete refgenie assets
  (commonly missing prealignment FASTA), with diagnostic and fix
  commands. The error itself is upstream refgenconf behavior;
  replaced by the refgenie 1.0 migration (PR #327). (#216)
- glossary.md: document column formats for _peaks_coverage.bed
  (8 columns) and _ref_peaks_coverage.bed (15 columns; narrowPeak
  coordinates + bedtools coverage stats + normalized count). (#233)
- assets.md: Running a non-refgenie genome through looper
  subsection — sample_modifiers/imply pattern with chrom_sizes,
  genome_index, etc. set per-sample. (#231 docs portion)

Closes #235, #252, #251, #215, #216, #233.
Two distinct breakages in the integration-test runner that both surface
immediately when running ./tests/scripts/test-integration.sh against a
fresh bulker install:

1. Default crate name didnt match what bulker actually caches.

   The scripts defaulted PEPATAC_TEST_BULKER_CRATE to local/bulker_manifest,
   but bulker caches tests/bulker_manifest.yaml as bulker/pepatac:1.1.1 --
   it auto-namespaces the manifests name: pepatac field under bulker/.
   bulker crate list | grep -q local/bulker_manifest therefore always
   failed, even immediately after a successful
   bulker crate install tests/bulker_manifest.yaml, leaving the runner
   wedged in a crate not cached loop. services.shs usage banner and
   tests/README.mds env-var table both advertised yet a *third* name
   (databio/pepatac), used nowhere in the code -- doc drift.

   Realigned all three to bulker/pepatac to match what bulker caches.

2. PATH extraction via bulker activate --echo was format-fragile and
   broke on bulker 0.0.15 (the Rust rewrite).

   test-integration.sh and services.sh both invoked
       bulker activate --echo  | grep ^export PATH= | sed ... | cut ...
   to fish the crates shim directory out of bulkers activation output.
   On bulker 0.0.15, --echo errored (argument --echo cannot be used
   multiple times -- likely a clap quirk in this version), and even when
   it worked the parse was brittle to quoting/formatting changes between
   bulker releases.

   Switched to bulker exec <crate> -- <cmd>, which lets bulker manage
   PATH for the duration of one command. The pytest invocation in
   test-integration.sh and the per-tool which check in services.sh
   now both run inside bulker exec, with no PATH scraping at all.

Files:
- tests/scripts/test-integration.sh: default BULKER_CRATE to bulker/pepatac;
  drop the activate-and-extract-PATH block; run pytest via bulker exec.
- tests/scripts/services.sh: default BULKER_CRATE to bulker/pepatac; fix
  usage banner; rewrite check_tools to probe each tool via
  bulker exec ... -- which <tool>.
- tests/README.md: update env-var table to bulker/pepatac.
- tests/integration/{conftest,test_looper_run,test_end_to_end}.py:
  refresh docstring prereq lines.

Regression from f20c354 (clean up integration tests).
Adds testthat coverage for the two PEPATACr bugs fixed in 8bda2e4 so they
can't be silently reintroduced.

- Hoist peakCounts()'s local detect_ext closure to a package-internal
  helper .detectPeakCoverageExt(suffix, results_subdir, sample_names,
  genomes). Same behavior, no functional change; just makes the
  ext-detection logic unit-testable without reconstructing the full
  peakCounts() pipeline (which would need valid peak data, chrom sizes,
  and full sample_table setup).

- tests/testthat/test-peakCounts-ext.R: seven scenarios exercising the
  helper directly, including the exact mixed-state bug from #218/#219
  (*_peaks_coverage.bed.gz from the initial sample run alongside
  *_ref_peaks_coverage.bed from the --frip-ref-peaks re-run), the inverse
  mixed state, multi-genome sample tables, and the no-match / warning
  fall-through path.

- tests/testthat/test-plotAnno.R: three scenarios for the empty-input
  fallback from #232 — missing input file, empty (size-0) input file,
  and the full caller pattern (pdf() / print(plotAnno(...)) / dev.off())
  producing a non-empty placeholder pdf at the expected target path.
  Asserts the old bug's spurious file.path(output_pdf, ...) touch
  target is NOT created, so the fix can't silently regress.
Two distinct breakages in the integration-test runner that both surface
immediately when running ./tests/scripts/test-integration.sh against a
fresh bulker install:

1. Default crate name didn't match what bulker actually caches.

   PEPATAC_TEST_BULKER_CRATE defaulted to 'local/bulker_manifest', but
   bulker auto-namespaces the manifest's  field and caches
   as 'bulker/pepatac:1.1.1'. The  check therefore always failed even right
   after a successful .

2. PATH extraction via  was format-fragile and
   broke on bulker 0.0.15. Switched to ,
   which lets bulker manage PATH for the duration of one command, with
   no output parsing at all.

Realigned both scripts to default to 'bulker/pepatac'; rewrote
test-integration.sh's pytest invocation and services.sh's check_tools
to use . Refreshed the README env-var table, services.sh
usage banner, and the docstring prereq lines in conftest.py /
test_looper_run.py / test_end_to_end.py for consistency.

Regression from f20c354 (clean up integration tests).
Both test-integration.sh's install if not cached gate and
services.sh's check_crate_cached probed for crate availability via:

    bulker crate list 2>/dev/null | grep -q

But  prints crate name and tag in *separate*
whitespace-delimited columns, so a literal bulker/pepatac:1.1.1 grep
never matches even when the crate is freshly cached. test-integration.sh
silently fell through to install from local manifest every run
(harmless, just slow). services.sh hard-errored with Bulker crate ... is
not cached immediately after a successful install printed
Cached: bulker/pepatac:1.1.1 — the user-visible wedge.

Replaced both grep checks with ,
which directly tests the operation we actually care about. The hint
text in services.sh's error path now points at  (the manifest file path that bulker can
actually load) rather than echoing back the cache key, which bulker
would 404 on against hub.bulker.io.

Surfaced after 28edeec (Fix integration-test bulker crate default to
include tag) tightened the default to the full  identifier --
that change exposed the previously-masked format mismatch in the
list-grep check.
…lved one

After bulker-side fixes landed (4c56b39, 28edeec, 9123d8c), the
integration runner finally reached pytest -- only to die immediately with
No module named pytest. The python3 inside bulker exec resolved to
the host's miniforge base install rather than the caller's active conda
env (pepatac-env in this case), which is the one that actually has
pytest + pypiper + the rest of pepatac's requirements installed.

Cause: python3 is declared as a host_command in
tests/bulker_manifest.yaml, so bulker forwards it unmodified to the
host. But bulker's PATH ordering inside bulker exec walks the system
PATH in whatever order and picks the first python3 it finds -- which
on a typical HPC node is miniforge base, not the user's currently
activated conda env.

Fix: capture /home/jps3dp/anaconda3/bin/python3 BEFORE entering bulker exec, then
pass that absolute path through to the exec'd command. Python locates
its own site-packages from sys.prefix based on executable path, so
the conda-env python keeps its installed packages regardless of how
it's invoked.

Also adds an early-exit ERROR if no python3 is on PATH at all
(otherwise the failure mode is a confusing No module named pytest
inside bulker exec several seconds later), and prints the resolved
python path in the runner banner so it's clear which interpreter the
tests are running under.
Commit 598aa3b captured command -v python3 before entering bulker
exec, on the assumption that the callers active env would be first
on PATH. That assumption holds on developer machines where
conda activate is the last PATH-mutating step. On HPC nodes where
a module load miniforge fires AFTER conda activate pepatac-env,
the modules PATH prepend buries the conda envs bin/ behind miniforge
base -- command -v python3 returns miniforge (no pytest installed)
even though the prompt says (pepatac-env) and CONDA_PREFIX points at
the env.

Rather than picking a single env var or PATH lookup as authoritative,
walk a small candidate list and pick the FIRST python that actually
imports pytest:

    PYTHON_CANDIDATES=()
    [ -n /home/jps3dp/anaconda3 ] && PYTHON_CANDIDATES+=(/home/jps3dp/anaconda3/bin/python3)
    [ -n  ]  && PYTHON_CANDIDATES+=(/bin/python3)
    PYTHON_CANDIDATES+=(/home/jps3dp/anaconda3/bin/python3)
    PYTHON_CANDIDATES+=(/home/jps3dp/anaconda3/bin/python)

    for candidate in ; do
        [ -x  ] || continue
        if  -c import pytest >/dev/null 2>&1; then
            ACTIVE_PYTHON=
            break
        fi
    done

If nothing in the list imports pytest, the script errors out with the
exact list of candidates that were tried -- a clearer failure than the
previous No module named pytest buried inside bulker exec output.
- tools/pepatac_summarizer/: Python package with CLI, consensus
  peak calling via gtars, peak counts, and summary plots
- pipelines/pepatac_collator.py: --summarizer python|R dispatch,
  defaults to python
- Remove obsolete PEPATACr R tests
- tests/test_summarizer.py: unit tests
- tests/test_summarizer_integration.py: integration tests
- Add plot_tss_distance using TssIndex.from_regionset.calc_tss_distances;
  wire into pepatac.py anno block (replaces R placeholder)
- Fix plot_frif to sum read counts from bedtools coverage outputs
- Reorder plot_partition_distribution to horizontal stacked bar with
  inline percent labels; add natural chrom sort + canonical chrom filter
- Add fragment-distribution median; add chrom/tssdist/part/frif CLI subcommands
- Align PartitionList.from_gtf defaults to R's GenomicDistributions:
  core_prom=100, prox_prom=2000 (was 2000/10000)
@jpsmith5 jpsmith5 requested a review from nsheff May 28, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants