First pass at adding bamslice by mattsoup · Pull Request #62 · nebiolabs/EM-seq

mattsoup · 2026-06-30T17:53:16Z

No description provided.

Copilot

Pull request overview

This PR introduces uBAM byte-range chunking via bamslice to parallelize trimming, and then merges per-chunk fastp JSON reports back into a single per-library fastp.json for downstream aggregation. It updates the workflow wiring and nf-test expectations to reflect chunk-named outputs.

Changes:

Replace fastp --split_by_lines chunking with bamslice-based uBAM byte-range chunking (params.bamslice_chunk_size) and run fastp per slice.
Add a new mergeFastpJson module to merge per-slice fastp JSON outputs into ${library}.fastp.json.
Update nf-test assertions and snapshots for new chunk-derived filenames and metrics.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`main.nf`	Builds uBAM byte-range chunks, runs `fastp` per chunk, merges fastp JSON, and feeds merged JSON into aggregation.
`modules/fastp.nf`	Switches stdin source from `samtools fastq` to `bamslice` with explicit start/end offsets; emits per-chunk JSON/FASTQs.
`modules/merge_fastp_json.nf`	New process to merge per-chunk fastp JSONs into a single per-library JSON for reporting/aggregation.
`nextflow.config`	Replaces `fastq_split_lines` with `bamslice_chunk_size` default and comment describing the new behavior.
`tests/main.nf.test`	Updates expected output filenames to match offset-based chunk naming and configures `bamslice_chunk_size` for tests.
`tests/main.nf.test.snap`	Updates stored snapshots to match new task graph and output content/metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // Split each uBAM into byte-range chunks so trimming runs in parallel per chunk.
+        chunk_size = params.bamslice_chunk_size as long
+        bam_chunks = passed_bams.flatMap { library, bam ->
+            def file_size = bam.size()
+            def chunks = []
+            for (long start = 0; start < file_size; start += chunk_size) {
+                long end = Math.min(start + chunk_size, file_size)
+                chunks << tuple(library, start, end)
            }
+            chunks
        }


@@ -30,7 +30,7 @@ nextflow_pipeline {
            def alignment_metrics = path("${launchDir}/test_output/stats/picard_alignment_metrics/emseq-test1.alignment_summary_metrics.txt").text.tokenize('\n')[5..8]
            def methyldackel_extract = path("${launchDir}/test_output/methylDackelExtracts/emseq-test1_CpG.methylKit.gz").md5
            def mbias = path("${launchDir}/test_output/methylDackelExtracts/mbias/emseq-test1.combined_mbias.tsv").md5
-            def nonconverted = path("${launchDir}/test_output/bwameth_align/0001.emseq-test1.nonconverted_counts.tsv").text.tokenize('\n')
+            def nonconverted = path("${launchDir}/test_output/bwameth_align/emseq-test1_0_250000.nonconverted_counts.tsv").text.tokenize('\n')


@@ -96,7 +96,7 @@ nextflow_pipeline {
            def alignment_metrics = path("${launchDir}/test_output/stats/picard_alignment_metrics/emseq-test1.alignment_summary_metrics.txt").text.tokenize('\n')[5..8]
            def methyldackel_extract = path("${launchDir}/test_output/methylDackelExtracts/emseq-test1_CpG.methylKit.gz").md5
            def mbias = path("${launchDir}/test_output/methylDackelExtracts/mbias/emseq-test1.combined_mbias.tsv").md5
-            def nonconverted = path("${launchDir}/test_output/bwameth_align/0001.emseq-test1.nonconverted_counts.tsv").text.tokenize('\n')
+            def nonconverted = path("${launchDir}/test_output/bwameth_align/emseq-test1_0_250000.nonconverted_counts.tsv").text.tokenize('\n')


… release

lnblum

looks good to me. I wonder if it's worth giving fastp more threads? If I recall, 1 thread was required to avoid a bug for the --fastq_split_lines, which is not going to be used.

First pass at adding bamslice

3303f07

Copilot AI review requested due to automatic review settings June 30, 2026 17:53

Copilot started reviewing on behalf of mattsoup June 30, 2026 17:53 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Seq Shepherd added 2 commits July 2, 2026 12:43

Updating bamslice to include adapter dimers, plus corresponding fastp…

b007616

… release

More useful comment

9a0b448

mattsoup assigned lnblum and bwlang Jul 2, 2026

mattsoup requested review from bwlang and lnblum July 2, 2026 17:06

mattsoup added 2 commits July 2, 2026 13:08

Updating snapshot to account for new fastp version

3701b9c

Updating snapshot from the cluster rather than my laptop

2517bed

lnblum approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First pass at adding bamslice#62

First pass at adding bamslice#62
mattsoup wants to merge 5 commits into
masterfrom
bamslice

mattsoup commented Jun 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lnblum left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mattsoup commented Jun 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

lnblum left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants