Virprof identifies viral infections and in RNA-SEQ data and produces full-length phylogenty quality viral genomes. While it was designed primarily to detect respiratory viral infections, it will find commensal microbial gene sequences and other viruses as well.
Here is an example report generated by virprof: test.summary.xlsx
-
Create a test directory and download the container image
mkdir virprof_test cd virprof_test apptainer pull library://epruesse/default/virprof -
Initialize the test directory with reference data and configuration files
./virprof_latest.sif init-test --force .This command will install
test_data/: a folder containing a very tiny example dataset, including the data for the "hg38test" reference dataset.databases/nt: a tiny BLAST database (NOT actually NT)ymp.yml: YMP basic configuration fileconfig.yml: YMP config with VirProf parametersrespiratory_virus_whitelist.txt: Default resp. vir. whitelist
These files are enough to do a test-run of the pipeline. If you are planning to analyze actual data, you will need to replace
databases/ntwith a link to the location of your actual copy of the NCBI NT BLAST database. -
Run VirProf
./virprof_latest.sif ymp make test.ref_hg38gtest.rnaseq_salmon.pathogen --ri -j32Anatomy of this command:
ymp makeinitiates a local build, usingympto runsnakemake.test.ref_hg38gtest.rnaseq_salmon.pathogendefines the YMP pipeline. YMP uses a stack of stages to process data. We begin with the project stagetest, which just makes its files available. Then we do the same for thehg38gtestreference, making it available.rnaseq_salmonis a pipeline itself comprising a number of stages. Lastly, we run based on all of those results thepathogenpipeline. The outputs of each stage will be in folders named according to the stage stack at the end of the run.- The
--rioption instructs YMP to instruct Snakemake to "rerun interrupted" jobs. That is, if you abort a run, it will re-calculate files that were not completed, rather than complaining about them. - The
-j32option sets the maximum number of threads to 32.
You could also run this in parts, using first
test.ref_hg38gtest.rnaseq_salmon, generating only the RNA-Seq outputs, and thentest.ref_hg38gtest.rnaseq_salmon.pathogento build on the results of the first invocation, generating the pathogen results. -
Check out the results
$ ls -1 reports test.salmon_sa.0.9.0.gene_counts.2025-05-15_18-15.rds test.salmon_sa.0.9.0.genomes.2025-05-15_18-19 test.salmon_sa.0.9.0.multiqc_report.2025-05-15_18-15.html test.salmon_sa.0.9.0.multiqc_report_data.2025-05-15_18-15 test.salmon_sa.0.9.0.pathogens.2025-05-15_18-19.rds test.salmon_sa.0.9.0.pathogens.2025-05-15_18-19.xlsx test.salmon_sa.0.9.0.stats.2025-05-15_18-15.rds test.salmon_sa.0.9.0.tx_counts.2025-05-15_18-15.rds test.salmon_sa.0.9.0.vp_multiqc_report.2025-05-15_18-19.html test.salmon_sa.0.9.0.vp_multiqc_report_data.2025-05-15_18-19At the end of a run, VirProf will copy the results out to the reports folder. This is to avoid overwriting results on subsequent runs. Each file has the following format:
<project-name>.<pipeline-slug>.<version>.<result-name>.<date>.<ext>The
<project-name>is the dataset you configured. The<pipeline-slug>is the type of preprocessing pipeline you chose, e.g.salmon_safor usingrnaseq_salmon, meaning Salmon in Selective Alignment mode. The<result-name>options are as follows:-
gene_counts: RDS format R file containing a SummarizedExperiment object with the counts as estimated by Salmon aggregated to gene level. You can use this with e.g. DESeq2 directly:se <- readRDS('test.salmon_sa.0.9.0.gene_counts.2025-05-15_18-15.rds') dds <- DESeqDataSet(se, design = ~1) -
genomes: This is a folder with the multi-FASTA files containing the recovered genomes. Files are aggregated per species. -
multiqc_report: Results from MultiQC - the data files and the html report are both copied out. -
pathogens: These are the reports from VirProf pathogen analysis. Two versions are copied out, therdsis more complete, to be used fromR, while thexlsxgives an easy overview of your dataset. -
stats: This is a subset of thegenome_countsRDS, containing just the QC metadata for use with plotting scripts and across many datasets. -
tx_counts: This is the same format asgene_counts, but contains the original per-transcript counts before agregation. -
vp_multiqc_report: This is the same as themultiqc_report, but run after the pathogen pipeline completes.
-
- Install from Bioconda
TBD - packages to be uploaded very soon.
- Install from GitHub
TBD - also coming soon. Check containers/virprof.def for how it's
done for the container.
Most reference data needed by Virprof is installed automatically. Only the BLAST NT database needs to be installed manually. This is because of the size (70GB+ download, 100GB+ unpacked) and the high likelihood of your system already possessing a copy somwehere.
The Bioconda blast package comes with update_blastdb.pk, which
automates downloading and unpacking NT. If using YMP, you could
e.g. do this:
cd /path/to/your/new/NT/folder
ymp env run blast -- update_blastdb.pl --decompress nt
Once you have a NT database, all you need to do is point VirProf to it using a symlink:
cd databases
ln -s /path/to/your/new/NT/folder nt
-
Prepare a sample sheet in CSV, TSV or Excel format listing your samples.
Each row should have a unit name, a potentially identical sample name and the locations of the forward and reverse reads. The name of the column with the sample names must be
sample, the other columns can have arbitrary names. These columns will end up in your RDS output, including any additional, arbitrary, columns you add to the sample sheet.unit sample fq1 fq2 A A 00000001_S7_L002_R1_001.fastq.gz 00000001_S7_L002_R2_001.fastq.gz B B 00000002_S1_L002_R1_001.fastq.gz 00000002_S1_L002_R2_001.fastq.gz A2 A other_run/00000123_S29_L001_R1_001.fastq.gz other_run/00000123_S29_L001_R2_001.fastq.gz Paths are relative to the location of the sample sheet file (unless they start with a
/).You can also list SRA run identifiers directly:
sample srr A SRR123123123 B ERR456456456 -
Create a YMP project for your samples
To add this project to your Virprof configuration, add the following to the
ymp.ymlfile:projects: name_of_my_project: data: path_to_my_sample_sheet.csv
Virprof was implemented with YMP. For all the details, refer to their documentation.
VirProf implements the following pipelines:
rnaseq_salmon: RNA-Seq using Salmon SArnaseq_star_salmon: RNA-Seq using STAR (two pass) + Salmon (quant only)rnaseq_star_umi_salmon: RNA-Seq with STAR+Salmon and UMI-Dedup between mapping and quantification. Expects the UMI to already have been moved to the read name.pathogen: The actual VirProf pipeline. Run after one of thernaseq_*pipelines as those are used for initial host depletion.virus: The same aspathogen, but does not use thefilter_trfstage to remove tandem repeat fragments. Without TRF filtering, non-viral results are obscured by host TRF reads leading to spurious detections on anything in NT that has TRF sequences.
It also comes with some references:
hg38g: The Gencode 38 reference (uses GRCh38, but Gencode 38 is a different 38)hg38gtest: A tiny test version that only includes a single gene.mm39g: The Gencode 27 reference (uses GRCm39)NcbiTaxonomy: Downloads the NCBI taxonomyVHDC: Downloads the VirusHostDB from genome.jp
A typical run would look like this:
ymp make -j64 myproject.ref_hg38g.rnaseq_salmon.pathogen
(Replace 64 with the number of threads you want to use).
To run the pipeline on a local SLURM cluster, just run ymp init cluster once to create the basic configuration. Then run
ymp submit <pipeline>
The submit job will exist for the time of the pipeline run. Depending on how much compute you may use on the login nodes, you may need to submit this as well. This can be small, it is used only for simple steps such as creating symlinks. However, the underlying Snakemake does sometimes decide to use a lot of CPU.
TBD