scraps

scraps extracts mRNA polyadenylation sites from "TVN"-primed single-cell RNA-seq libraries at near-nucleotide resolution.

scraps (Single Cell RNA PolyA Site Discovery) is currently implemented as a Snakemake pipeline for 10X Genomics 3' end v2/3 libraries (and other platforms with similar library structure, including Drop-seq, Microwell-seq, and BD Rhapsody). If long Read1 is available (estimated ~6% of SRA-deposited data, or now planning new experiments), positional information will be calculated from paired realignment; otherwise, the less optimal anchored Read2 approach is used. scraps will eventually be expanded for analyzing a range of RNA processing changes in single-cell RNA-seq data.

For additional discussions and usage cases, please see bioRxiv preprint.

Example usage
Configuration
Supported scRNA-seq platforms
Output
Setup
Dependencies
For Developers
Extended function

Example usage

scraps requires the following as input (defined in config.yaml):

10X Genomics 3' v2/3 single-cell FASTQs or other platforms (with names "_R1.fastq.gz"" and "_R2.fastq.gz"")
A STAR genome index (must be generated with STAR 2.7.4a and above)
Whitelist for cell barcodes (optional but recommended to speed up run time)
A featureCounts reference (SAF-formatted polya_db, hg38 and mm10 files are included in ref subdirectory)

Quick Start

Set up conda environment:

conda env create -f scraps_conda.yml
conda activate scraps_conda

Configure your samples in config.yaml under the SAMPLES section
Run the pipeline:

snakemake --configfile config.yaml --resources total_impact=5 --keep-going

Detailed Usage

To run test data, simply execute:

snakemake --snakefile Snakefile \
  --configfile config.yaml \
  --resources total_impact=5 \
  --keep-going

DAG steps illustration

submit jobs in cluster mode

Notes: total_impact is set to 5 for each sample, change this to control how many samples are processed in parallel

Configuration

scraps uses two main configuration files for flexible pipeline setup:

config.yaml

Main pipeline configuration file containing:

DATA: Directory containing input FASTQ files
RESULTS: Output directory for pipeline results
STAR_INDEX: Path to STAR genome index directory
POLYA_SITES: PolyA database reference file (SAF format, provided in ref/)
DEFAULTS: Default chemistry and platform settings
- platform: Default sequencing platform (e.g., illumina, element, ultima)
- chemistry: Default chemistry type (e.g., chromiumV3, chromiumV2, dropseq)
- alignments: Which alignment modes to run ([R1, R2, paired])
SAMPLES: Per-sample configuration

Sample Configuration Example:

SAMPLES:
  sample_name:
    basename: sample-           # FASTQ file prefix
    platform: illumina          # Sequencing platform
    chemistry: chromiumV3       # Platform chemistry
    alignments:                 # Optional: override default alignments
      - R2
      - paired

chemistry.yaml

Platform and chemistry-specific parameters organized hierarchically. Each chemistry type (chromiumV3, chromiumV2, dropseq, microwellseq, bd, indrop) contains:

bc_whitelist: Path to barcode whitelist file (optional)
bc_cut: Adapter sequences for complex barcode extraction (optional)
Platform-specific settings: Nested configurations for different sequencing platforms
- cutadapt_R1 / cutadapt_paired: Adapter trimming parameters
- STAR_R1 / STAR_R2: STAR alignment parameters (UMI/barcode positions)

Configuration Hierarchy

The pipeline uses hierarchical configuration lookup to determine parameters for each sample:

┌─────────────────────────────────────────────────────────┐
│  1. Sample-specific settings (config.yaml SAMPLES)     │
│     Highest priority - overrides everything             │
└────────────────────┬────────────────────────────────────┘
                     │ If not found ↓
┌─────────────────────────────────────────────────────────┐
│  2. Chemistry + Platform (chemistry.yaml)               │
│     e.g., chromiumV3 → illumina → STAR_R1               │
└────────────────────┬────────────────────────────────────┘
                     │ If not found ↓
┌─────────────────────────────────────────────────────────┐
│  3. Chemistry defaults (chemistry.yaml)                 │
│     e.g., chromiumV3 → bc_whitelist                     │
└────────────────────┬────────────────────────────────────┘
                     │ If not found ↓
┌─────────────────────────────────────────────────────────┐
│  4. Global defaults (config.yaml DEFAULTS)              │
│     Lowest priority - fallback values                   │
└─────────────────────────────────────────────────────────┘

This allows platform-specific customization (e.g., Illumina vs Ultima Genomics) while maintaining chemistry-specific defaults.

Supported scRNA-seq platforms

Platform	Library (BC+UMI+A)	Setting	Test data
10x Chromium V3	16 + 12 + 30	chromiumV3	✓
10x V3 - Ultima Genomics	adapter + 16 + 9 + 3 ignored + 8	chromiumV3UG
10x Chromium V2	16 + 10 + 30	chromiumV2	✓
10x Chromium Visium	16 + 10 + 30	visium
Drop-seq	12 + 8 + 30	dropseq	✓
Microwell-seq	6x3 + 6 + 30	microwellseq	✓
BD Rhapsody	9x3 + 8 + 18	bd
inDrop	8 + 6 + 18	indrop

Custom chemistry supported, by editing chemistry.yaml. Also see synthetic FASTQ tool.

Output

bedgraph : TVN-priming site pileup

chr11   215106  215107  1
chr11   689216  689217  1
chr11   812862  812863  1
chr11   812870  812871  2
chr11   812871  812872  2

count table : +-10 around PolyA_DB sites, by cell barcode

gene    cell    count
AC135178.2_NA_ENSG00000263809_chr17_8377523_-_Intron,RPL26_6154_ENSG00000161970_chr17_8377523_-_3'UTR(M)        AACTCCCGTTCCTCCA        1
AC135178.2_NA_ENSG00000263809_chr17_8377523_-_Intron,RPL26_6154_ENSG00000161970_chr17_8377523_-_3'UTR(M)        CCCATACGTTAAAGAC        1
AC135178.2_NA_ENSG00000263809_chr17_8377523_-_Intron,RPL26_6154_ENSG00000161970_chr17_8377523_-_3'UTR(M)        CGTCCATTCGACAGCC        1
ACTG1_71_ENSG00000184009_chr17_81509999_-_3'UTR(M)      ACATCAGGTGATGTCT        1
ADRM1_11047_ENSG00000130706_chr20_62308862_+_3'UTR(M)   CAGCGACTCTGCCCTA        1

html report : various metrics from steps in the pipeline

R functions available for importing results into Seurat object, and finding differential PA site usage. Alternatively, a package of the same functions can be installed with remotes::install_github("rnabioco/scrapR")

Setup

1. Clone repository

git clone https://github.com/rnabioco/scraps
cd scraps

2. Set up conda environment (recommended)

conda env create -f scraps_conda.yml
conda activate scraps_conda

Alternatively, ensure all dependencies are installed and available in your PATH.

3. Prepare reference files

STAR genome index

Place STAR index in the ref/ directory or specify custom path in config.yaml (STAR_INDEX)

Download link (extract after download):

GRCh38 index

Barcode whitelists (optional but recommended)

Whitelist paths are configured per chemistry in chemistry.yaml. Place downloaded whitelists in the ref/ directory.

Download links (extract after download):

10x V2 barcodes → ref/737K-august-2016.txt
10x V3 barcodes → ref/3M-february-2018.txt

Update chemistry.yaml with the correct paths:

chromiumV3:
  bc_whitelist: ref/3M-february-2018.txt
chromiumV2:
  bc_whitelist: ref/737K-august-2016.txt

4. Configure your samples

Edit config.yaml to specify:

DATA: Path to directory containing FASTQ files (with naming pattern: *_R1.fastq.gz, *_R2.fastq.gz)
RESULTS: Output directory path
STAR_INDEX: Path to STAR genome index
POLYA_SITES: PolyA database reference (provided: ref/polyadb32.hg38.saf.gz or ref/polyadb32.mm10.saf.gz)
SAMPLES: Define each sample with:
- basename: FASTQ filename prefix
- chemistry: Platform chemistry type (chromiumV2, chromiumV3, dropseq, microwellseq, bd, indrop)
- platform: Sequencing platform (illumina, element, ultima)
- alignments: Optional list of alignment modes to run

Example:

SAMPLES:
  my_sample:
    basename: SRR9887775_        # Matches SRR9887775_R1.fastq.gz, SRR9887775_R2.fastq.gz
    chemistry: chromiumV3
    platform: illumina

Note: SRA accessions (e.g., SRR9887775) can be used directly as basenames for automatic download.

5. Run the pipeline

# Dry-run to check configuration
snakemake -npr --configfile config.yaml

# Run pipeline
snakemake --configfile config.yaml --resources total_impact=5 --keep-going

# Or with specific core count
snakemake -j 8 --configfile config.yaml

Sample test results can be found at inst/test_output/

Dependencies

scraps requires the following executables in your PATH:

Python 3 (>= 3.7)
Snakemake (>= 5.3.0, < 8.0)
UMI-tools (>= 1.1.2)
cutadapt (>= 3.4)
STAR (>= 2.7.9a)
Samtools (>= 1.15)
Bedtools (>= 2.30.0)
Subread (>= 2.0.1)
MultiQC (>= 1.6)
pysam (>= 0.16.0)
drmaa (>= 0.7.9, for cluster execution)
zsh (shell used for pipeline execution)

Recommended: Use Conda to manage these dependencies:

conda env create -f scraps_conda.yml
conda activate scraps_conda

All required dependencies (including zsh) will be installed automatically.

Docker image for automated deployment can also be found at https://hub.docker.com/r/rnabioco/scraps.

Please also see the Snakemake documentation for general information on executing and manipulating snakemake pipelines.

For Developers

For detailed development guidelines including code style conventions, testing procedures, and instructions for adding new rules or chemistry configurations, see AGENTS.md.

Key resources:

Code style: Python, R, and Snakemake conventions
Testing: Using Snakemake dry-run for validation
Adding rules: How to extend the pipeline
Chemistry config: Adding new platform support
Debugging: Common issues and solutions

Extended function

1) Measuring internal priming as indicator of apoptotic cytoplasmic poly(A) RNA decay

(Based on widespread RNA decay during apoptosis: Liu and Fu et al.) Use SAF (hg38 version provided in ref subdirectory) file marking all gene regions (5'UTR, intron, CDS, 3'UTR), and helper R functions to process output. Please see Rmarkdown notebook for more.

2) Accurate intron/exon quantification for RNA velocity

(See discussions on quantification approaches and pitfalls: Soneson et al.)

Consideration	scraps
Avoid feature double-counting	✓
Take strandedness into account	✓
Avoid count substraction	✓
Resolve spliced vs unspliced target	✓
Speed	✓

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
.github/workflows		.github/workflows
inst		inst
man/figures		man/figures
ref		ref
rules		rules
sample_data/raw_data		sample_data/raw_data
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
chemistry.yaml		chemistry.yaml
config.yaml		config.yaml
scraps_conda.yml		scraps_conda.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraps

scraps extracts mRNA polyadenylation sites from "TVN"-primed single-cell RNA-seq libraries at near-nucleotide resolution.

Example usage

Quick Start

Detailed Usage

Configuration

config.yaml

Sample Configuration Example:

chemistry.yaml

Configuration Hierarchy

Supported scRNA-seq platforms

Output

Setup

1. Clone repository

2. Set up conda environment (recommended)

3. Prepare reference files

STAR genome index

Barcode whitelists (optional but recommended)

4. Configure your samples

5. Run the pipeline

Dependencies

For Developers

Extended function

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

rnabioco/scraps

Folders and files

Latest commit

History

Repository files navigation

scraps

scraps extracts mRNA polyadenylation sites from "TVN"-primed single-cell RNA-seq libraries at near-nucleotide resolution.

Example usage

Quick Start

Detailed Usage

Configuration

config.yaml

Sample Configuration Example:

chemistry.yaml

Configuration Hierarchy

Supported scRNA-seq platforms

Output

Setup

1. Clone repository

2. Set up conda environment (recommended)

3. Prepare reference files

STAR genome index

Barcode whitelists (optional but recommended)

4. Configure your samples

5. Run the pipeline

Dependencies

For Developers

Extended function

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages