VirProf - Virus Identification, Quantification and Genome Recovery from RNA-Seq NGS data

Virprof identifies viral infections and in RNA-SEQ data and produces full-length phylogenty quality viral genomes. While it was designed primarily to detect respiratory viral infections, it will find commensal microbial gene sequences and other viruses as well.

Here is an example report generated by virprof: test.summary.xlsx

I. Testing VirProf using Apptainer or Singularity for the impatient

Create a test directory and download the container image

mkdir virprof_test
cd virprof_test
apptainer pull library://epruesse/default/virprof

Initialize the test directory with reference data and configuration files
```
./virprof_latest.sif init-test --force .
```
This command will install
- test_data/: a folder containing a very tiny example dataset, including the data for the "hg38test" reference dataset.
- databases/nt: a tiny BLAST database (NOT actually NT)
- ymp.yml: YMP basic configuration file
- config.yml: YMP config with VirProf parameters
- respiratory_virus_whitelist.txt: Default resp. vir. whitelist
These files are enough to do a test-run of the pipeline. If you are planning to analyze actual data, you will need to replace databases/nt with a link to the location of your actual copy of the NCBI NT BLAST database.
Run VirProf
```
./virprof_latest.sif ymp make test.ref_hg38gtest.rnaseq_salmon.pathogen --ri -j32
```
Anatomy of this command:
- ymp make initiates a local build, using ymp to run snakemake.
- test.ref_hg38gtest.rnaseq_salmon.pathogen defines the YMP pipeline. YMP uses a stack of stages to process data. We begin with the project stage test, which just makes its files available. Then we do the same for the hg38gtest reference, making it available. rnaseq_salmon is a pipeline itself comprising a number of stages. Lastly, we run based on all of those results the pathogen pipeline. The outputs of each stage will be in folders named according to the stage stack at the end of the run.
- The --ri option instructs YMP to instruct Snakemake to "rerun interrupted" jobs. That is, if you abort a run, it will re-calculate files that were not completed, rather than complaining about them.
- The -j32 option sets the maximum number of threads to 32.
You could also run this in parts, using first test.ref_hg38gtest.rnaseq_salmon, generating only the RNA-Seq outputs, and then test.ref_hg38gtest.rnaseq_salmon.pathogen to build on the results of the first invocation, generating the pathogen results.
Check out the results
```
$ ls -1 reports
test.salmon_sa.0.9.0.gene_counts.2025-05-15_18-15.rds
test.salmon_sa.0.9.0.genomes.2025-05-15_18-19
test.salmon_sa.0.9.0.multiqc_report.2025-05-15_18-15.html
test.salmon_sa.0.9.0.multiqc_report_data.2025-05-15_18-15
test.salmon_sa.0.9.0.pathogens.2025-05-15_18-19.rds
test.salmon_sa.0.9.0.pathogens.2025-05-15_18-19.xlsx
test.salmon_sa.0.9.0.stats.2025-05-15_18-15.rds
test.salmon_sa.0.9.0.tx_counts.2025-05-15_18-15.rds
test.salmon_sa.0.9.0.vp_multiqc_report.2025-05-15_18-19.html
test.salmon_sa.0.9.0.vp_multiqc_report_data.2025-05-15_18-19
```
At the end of a run, VirProf will copy the results out to the reports folder. This is to avoid overwriting results on subsequent runs. Each file has the following format:
```
<project-name>.<pipeline-slug>.<version>.<result-name>.<date>.<ext>
```
The <project-name> is the dataset you configured. The <pipeline-slug> is the type of preprocessing pipeline you chose, e.g. salmon_sa for using rnaseq_salmon, meaning Salmon in Selective Alignment mode. The <result-name> options are as follows:
- gene_counts: RDS format R file containing a SummarizedExperiment object with the counts as estimated by Salmon aggregated to gene level. You can use this with e.g. DESeq2 directly:
```
se <- readRDS('test.salmon_sa.0.9.0.gene_counts.2025-05-15_18-15.rds')
dds <- DESeqDataSet(se, design = ~1)
```
- genomes: This is a folder with the multi-FASTA files containing the recovered genomes. Files are aggregated per species.
- multiqc_report: Results from MultiQC - the data files and the html report are both copied out.
- pathogens: These are the reports from VirProf pathogen analysis. Two versions are copied out, the rds is more complete, to be used from R, while the xlsx gives an easy overview of your dataset.
- stats: This is a subset of the genome_counts RDS, containing just the QC metadata for use with plotting scripts and across many datasets.
- tx_counts: This is the same format as gene_counts, but contains the original per-transcript counts before agregation.
- vp_multiqc_report: This is the same as the multiqc_report, but run after the pathogen pipeline completes.

II. Installing VirProf

Install from Bioconda

TBD - packages to be uploaded very soon.

Install from GitHub

TBD - also coming soon. Check containers/virprof.def for how it's done for the container.

III. Install Reference data

Most reference data needed by Virprof is installed automatically. Only the BLAST NT database needs to be installed manually. This is because of the size (70GB+ download, 100GB+ unpacked) and the high likelihood of your system already possessing a copy somwehere.

The Bioconda blast package comes with update_blastdb.pk, which automates downloading and unpacking NT. If using YMP, you could e.g. do this:

cd /path/to/your/new/NT/folder
ymp env run blast -- update_blastdb.pl --decompress nt

Once you have a NT database, all you need to do is point VirProf to it using a symlink:

cd databases
ln -s /path/to/your/new/NT/folder nt

IV. Configuring Samples

Prepare a sample sheet in CSV, TSV or Excel format listing your samples.

Each row should have a unit name, a potentially identical sample name and the locations of the forward and reverse reads. The name of the column with the sample names must be sample, the other columns can have arbitrary names. These columns will end up in your RDS output, including any additional, arbitrary, columns you add to the sample sheet.

unit	sample	fq1	fq2
A	A	00000001_S7_L002_R1_001.fastq.gz	00000001_S7_L002_R2_001.fastq.gz
B	B	00000002_S1_L002_R1_001.fastq.gz	00000002_S1_L002_R2_001.fastq.gz
A2	A	other_run/00000123_S29_L001_R1_001.fastq.gz	other_run/00000123_S29_L001_R2_001.fastq.gz

Paths are relative to the location of the sample sheet file (unless they start with a /).

You can also list SRA run identifiers directly:

sample	srr
A	SRR123123123
B	ERR456456456

Create a YMP project for your samples

To add this project to your Virprof configuration, add the following to the ymp.yml file:
```
projects:
  name_of_my_project:
     data: path_to_my_sample_sheet.csv
```

III. Running Virprof

Virprof was implemented with YMP. For all the details, refer to their documentation.

VirProf implements the following pipelines:

rnaseq_salmon: RNA-Seq using Salmon SA
rnaseq_star_salmon: RNA-Seq using STAR (two pass) + Salmon (quant only)
rnaseq_star_umi_salmon: RNA-Seq with STAR+Salmon and UMI-Dedup between mapping and quantification. Expects the UMI to already have been moved to the read name.
pathogen: The actual VirProf pipeline. Run after one of the rnaseq_* pipelines as those are used for initial host depletion.
virus: The same as pathogen, but does not use the filter_trf stage to remove tandem repeat fragments. Without TRF filtering, non-viral results are obscured by host TRF reads leading to spurious detections on anything in NT that has TRF sequences.

It also comes with some references:

hg38g: The Gencode 38 reference (uses GRCh38, but Gencode 38 is a different 38)
hg38gtest: A tiny test version that only includes a single gene.
mm39g: The Gencode 27 reference (uses GRCm39)
NcbiTaxonomy: Downloads the NCBI taxonomy
VHDC: Downloads the VirusHostDB from genome.jp

A typical run would look like this:

ymp make -j64 myproject.ref_hg38g.rnaseq_salmon.pathogen

(Replace 64 with the number of threads you want to use).

To run the pipeline on a local SLURM cluster, just run ymp init cluster once to create the basic configuration. Then run

ymp submit <pipeline>

The submit job will exist for the time of the pipeline run. Depending on how much compute you may use on the login nodes, you may need to submit this as well. This can be small, it is used only for simple steps such as creating symlinks. However, the underlying Snakemake does sometimes decide to use a lot of CPU.

IV. Citing Virprof

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 625 Commits
.github/workflows		.github/workflows
R		R
bin		bin
conda_envs		conda_envs
containers		containers
data		data
doc		doc
rules		rules
test_data		test_data
virprof		virprof
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
environment.yml		environment.yml
environment_pytest.yaml		environment_pytest.yaml
example_ymp.yml		example_ymp.yml
overrides.yml		overrides.yml
pipelines.yml		pipelines.yml
pyproject.toml		pyproject.toml
references.yml		references.yml
requirements.txt		requirements.txt
respiratory_virus_whitelist.txt		respiratory_virus_whitelist.txt
setup.cfg		setup.cfg
setup.py		setup.py
virprof.yml		virprof.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VirProf - Virus Identification, Quantification and Genome Recovery from RNA-Seq NGS data

I. Testing VirProf using Apptainer or Singularity for the impatient

II. Installing VirProf

III. Install Reference data

IV. Configuring Samples

III. Running Virprof

IV. Citing Virprof

About

Uh oh!

Releases 3

Packages

Uh oh!

Languages

License

seiboldlab/virprof

Folders and files

Latest commit

History

Repository files navigation

VirProf - Virus Identification, Quantification and Genome Recovery from RNA-Seq NGS data

I. Testing VirProf using Apptainer or Singularity for the impatient

II. Installing VirProf

III. Install Reference data

IV. Configuring Samples

III. Running Virprof

IV. Citing Virprof

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Languages

Packages