Skip to content

seiboldlab/virprof

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unit Tests

VirProf - Virus Identification, Quantification and Genome Recovery from RNA-Seq NGS data

Virprof identifies viral infections and in RNA-SEQ data and produces full-length phylogenty quality viral genomes. While it was designed primarily to detect respiratory viral infections, it will find commensal microbial gene sequences and other viruses as well.

Here is an example report generated by virprof: test.summary.xlsx

I. Testing VirProf using Apptainer or Singularity for the impatient

  1. Create a test directory and download the container image

    mkdir virprof_test
    cd virprof_test
    apptainer pull library://epruesse/default/virprof
  2. Initialize the test directory with reference data and configuration files

    ./virprof_latest.sif init-test --force .
    

    This command will install

    • test_data/: a folder containing a very tiny example dataset, including the data for the "hg38test" reference dataset.
    • databases/nt: a tiny BLAST database (NOT actually NT)
    • ymp.yml: YMP basic configuration file
    • config.yml: YMP config with VirProf parameters
    • respiratory_virus_whitelist.txt: Default resp. vir. whitelist

    These files are enough to do a test-run of the pipeline. If you are planning to analyze actual data, you will need to replace databases/nt with a link to the location of your actual copy of the NCBI NT BLAST database.

  3. Run VirProf

    ./virprof_latest.sif ymp make test.ref_hg38gtest.rnaseq_salmon.pathogen --ri -j32
    

    Anatomy of this command:

    • ymp make initiates a local build, using ymp to run snakemake.
    • test.ref_hg38gtest.rnaseq_salmon.pathogen defines the YMP pipeline. YMP uses a stack of stages to process data. We begin with the project stage test, which just makes its files available. Then we do the same for the hg38gtest reference, making it available. rnaseq_salmon is a pipeline itself comprising a number of stages. Lastly, we run based on all of those results the pathogen pipeline. The outputs of each stage will be in folders named according to the stage stack at the end of the run.
    • The --ri option instructs YMP to instruct Snakemake to "rerun interrupted" jobs. That is, if you abort a run, it will re-calculate files that were not completed, rather than complaining about them.
    • The -j32 option sets the maximum number of threads to 32.

    You could also run this in parts, using first test.ref_hg38gtest.rnaseq_salmon, generating only the RNA-Seq outputs, and then test.ref_hg38gtest.rnaseq_salmon.pathogen to build on the results of the first invocation, generating the pathogen results.

  4. Check out the results

    $ ls -1 reports
    test.salmon_sa.0.9.0.gene_counts.2025-05-15_18-15.rds
    test.salmon_sa.0.9.0.genomes.2025-05-15_18-19
    test.salmon_sa.0.9.0.multiqc_report.2025-05-15_18-15.html
    test.salmon_sa.0.9.0.multiqc_report_data.2025-05-15_18-15
    test.salmon_sa.0.9.0.pathogens.2025-05-15_18-19.rds
    test.salmon_sa.0.9.0.pathogens.2025-05-15_18-19.xlsx
    test.salmon_sa.0.9.0.stats.2025-05-15_18-15.rds
    test.salmon_sa.0.9.0.tx_counts.2025-05-15_18-15.rds
    test.salmon_sa.0.9.0.vp_multiqc_report.2025-05-15_18-19.html
    test.salmon_sa.0.9.0.vp_multiqc_report_data.2025-05-15_18-19
    

    At the end of a run, VirProf will copy the results out to the reports folder. This is to avoid overwriting results on subsequent runs. Each file has the following format:

    <project-name>.<pipeline-slug>.<version>.<result-name>.<date>.<ext>
    

    The <project-name> is the dataset you configured. The <pipeline-slug> is the type of preprocessing pipeline you chose, e.g. salmon_sa for using rnaseq_salmon, meaning Salmon in Selective Alignment mode. The <result-name> options are as follows:

    • gene_counts: RDS format R file containing a SummarizedExperiment object with the counts as estimated by Salmon aggregated to gene level. You can use this with e.g. DESeq2 directly:

      se <- readRDS('test.salmon_sa.0.9.0.gene_counts.2025-05-15_18-15.rds')
      dds <- DESeqDataSet(se, design = ~1)
      
    • genomes: This is a folder with the multi-FASTA files containing the recovered genomes. Files are aggregated per species.

    • multiqc_report: Results from MultiQC - the data files and the html report are both copied out.

    • pathogens: These are the reports from VirProf pathogen analysis. Two versions are copied out, the rds is more complete, to be used from R, while the xlsx gives an easy overview of your dataset.

    • stats: This is a subset of the genome_counts RDS, containing just the QC metadata for use with plotting scripts and across many datasets.

    • tx_counts: This is the same format as gene_counts, but contains the original per-transcript counts before agregation.

    • vp_multiqc_report: This is the same as the multiqc_report, but run after the pathogen pipeline completes.

II. Installing VirProf

  1. Install from Bioconda

TBD - packages to be uploaded very soon.

  1. Install from GitHub

TBD - also coming soon. Check containers/virprof.def for how it's done for the container.

III. Install Reference data

Most reference data needed by Virprof is installed automatically. Only the BLAST NT database needs to be installed manually. This is because of the size (70GB+ download, 100GB+ unpacked) and the high likelihood of your system already possessing a copy somwehere.

The Bioconda blast package comes with update_blastdb.pk, which automates downloading and unpacking NT. If using YMP, you could e.g. do this:

cd /path/to/your/new/NT/folder
ymp env run blast -- update_blastdb.pl --decompress nt

Once you have a NT database, all you need to do is point VirProf to it using a symlink:

cd databases
ln -s /path/to/your/new/NT/folder nt

IV. Configuring Samples

  1. Prepare a sample sheet in CSV, TSV or Excel format listing your samples.

    Each row should have a unit name, a potentially identical sample name and the locations of the forward and reverse reads. The name of the column with the sample names must be sample, the other columns can have arbitrary names. These columns will end up in your RDS output, including any additional, arbitrary, columns you add to the sample sheet.

    unit sample fq1 fq2
    A A 00000001_S7_L002_R1_001.fastq.gz 00000001_S7_L002_R2_001.fastq.gz
    B B 00000002_S1_L002_R1_001.fastq.gz 00000002_S1_L002_R2_001.fastq.gz
    A2 A other_run/00000123_S29_L001_R1_001.fastq.gz other_run/00000123_S29_L001_R2_001.fastq.gz

    Paths are relative to the location of the sample sheet file (unless they start with a /).

    You can also list SRA run identifiers directly:

    sample srr
    A SRR123123123
    B ERR456456456
  2. Create a YMP project for your samples

    To add this project to your Virprof configuration, add the following to the ymp.yml file:

    projects:
      name_of_my_project:
         data: path_to_my_sample_sheet.csv

III. Running Virprof

Virprof was implemented with YMP. For all the details, refer to their documentation.

VirProf implements the following pipelines:

  • rnaseq_salmon: RNA-Seq using Salmon SA
  • rnaseq_star_salmon: RNA-Seq using STAR (two pass) + Salmon (quant only)
  • rnaseq_star_umi_salmon: RNA-Seq with STAR+Salmon and UMI-Dedup between mapping and quantification. Expects the UMI to already have been moved to the read name.
  • pathogen: The actual VirProf pipeline. Run after one of the rnaseq_* pipelines as those are used for initial host depletion.
  • virus: The same as pathogen, but does not use the filter_trf stage to remove tandem repeat fragments. Without TRF filtering, non-viral results are obscured by host TRF reads leading to spurious detections on anything in NT that has TRF sequences.

It also comes with some references:

  • hg38g: The Gencode 38 reference (uses GRCh38, but Gencode 38 is a different 38)
  • hg38gtest: A tiny test version that only includes a single gene.
  • mm39g: The Gencode 27 reference (uses GRCm39)
  • NcbiTaxonomy: Downloads the NCBI taxonomy
  • VHDC: Downloads the VirusHostDB from genome.jp

A typical run would look like this:

ymp make -j64 myproject.ref_hg38g.rnaseq_salmon.pathogen

(Replace 64 with the number of threads you want to use).

To run the pipeline on a local SLURM cluster, just run ymp init cluster once to create the basic configuration. Then run

ymp submit <pipeline>

The submit job will exist for the time of the pipeline run. Depending on how much compute you may use on the login nodes, you may need to submit this as well. This can be small, it is used only for simple steps such as creating symlinks. However, the underlying Snakemake does sometimes decide to use a lot of CPU.

IV. Citing Virprof

TBD

About

Pipeline for recovering and quantifying non-host genomes from RNA-seq

Resources

License

Stars

Watchers

Forks

Packages

No packages published