DIAG toolkit provides scripts for estimating Depth of Independent Amplicons (DIA) metrics from VCF files. It consists of a pre-processing script to extract relevant allele depths and a calculation script to compute metrics and generate visualization reports.
For bugs, suggestions, questions, collaborations etc. please contact MengDong Zhang
process_vcf.py: Parses a VCF file to extract allele depths for experiment samples at sites where the bulk sample is heterozygous.calc_DIA.py: Calculates DIA metrics, performs bootstrap analysis, and generates interactive HTML plots.simple_analyse.py: Calculates whole-genome DIA for multiple samples with default settings directly from a VCF file.
-
Clone the repository:
git clone https://github.com/doorholez/DIAG_toolkit.git cd DIAG_toolkit -
Install dependencies:
The scripts require Python 3 (>= 3.8) and the following libraries:
- numpy
- pandas
- scipy
- plotly
- natsort
You can install them using pip:
pip install numpy pandas scipy plotly natsort
This script takes a VCF file as input. It assumes the first sample in the VCF is the "Bulk" sample. It identifies heterozygous sites in this reference sample and extracts the corresponding allele depths for all subsequent "Experiment" samples.
Syntax:
python process_vcf.py [options] <input_vcf>Arguments:
input_vcf: Path to the input VCF file (supports.vcfand.vcf.gz).-p,--prefix: Prefix string to add to output filenames.-f,--filter: Filter string (e.g.,PASS). Only variants with this exact value in the FILTER column are processed.-R,--region: Limit processing to a specific genomic region (format:chrom:start-end).-e,--exclude_chroms: Space-separated list of chromosomes to exclude (e.g.,chrX chrY chrM). Excluding haploid chromosomes can improve the accuracy of DIA estimation.-d,--depth: Minimum depth threshold. Sites with total depth lower than this value (in either bulk or experiment) are skipped.
Output:
Generates one text file per experiment sample named {prefix}{SampleName}.txt.
The file format is tab-separated:
#chrom pos depth1 depth2
chr1 1000 20 15
...
This script takes the text file generated by process_vcf.py and performs the DIA calculation. You can generate your own input file if you have custom requirements. It splits the genome into regions, estimates allele frequencies and DIA for both whole genome and regions, and produces an interactive HTML report.
Syntax:
python calc_DIA.py [options] <input_file> [output_file]Arguments:
input_file: Path to the input text file. See Input File Format below.output_file: (Optional) Path for the output HTML report. Defaults to<input_filename>.html.-r,--region_size: Size of genomic regions to split for analysis (default: 10,000,000 bp).-b,--bootstrap: Number of bootstrap iterations for whole-genome estimation (default: 1000).-br,--bootstrap_region: Number of bootstrap iterations for split regions (default: 0.5 * global bootstrap).-q,--quiet: Quiet mode (suppresses standard output, keeps warnings).-s,--silent: Silent mode (suppresses all output).
The input file is a whitespace-separated text file. Lines starting with # are ignored. Each data line must contain 4 or 5 columns:
chrom: Chromosome name.pos: Genomic position.depth1: Depth of the first allele (e.g., Reference depth).depth2: Depth of the second allele (e.g., Alternative depth).ploidy: (Optional) Local ploidy. Defaults to 2 if omitted.
Example:
#chrom pos depth1 depth2 ploidy
chr1 1000 20 15 3
chr1 1050 10 12
Output:
An HTML file containing:
- Whole Genome DIA Bootstrap Distribution: Density distribution plot of bootstrap results.
- Copy Number & Allele Frequency: Genomic tracks showing estimated allele frequency and ploidy.
- Split Region DIA: Genomic track showing DIA values with 95% confidence intervals.
This script performs a simple whole-genome DIA analysis with default settings for all experiment samples in a VCF file directly. It skips region splitting and assumes a global ploidy (p=0.5), making it suitable for rapid quality control of multiple samples.
Syntax:
python simple_analyse.py [options] <input_vcf> [output_tsv]Arguments:
input_vcf: Path to the input VCF file (supports.vcfand.vcf.gz).output_tsv: (Optional) Path for the output TSV file. Defaults to<input_vcf_basename>_samples_eIA.tsv.-b,--bootstrap: Number of bootstrap iterations (default: 1000).-e,--exclude_chroms: Space-separated list of chromosomes to exclude (e.g.,chrX chrY chrM). Excluding haploid chromosomes can improve the accuracy of DIA estimation.
Output:
A TSV file containing DIA metrics for each sample:
sample_name DIA bootstrap_lower bootstrap_upper
Sample1 15.23 14.50 16.10
Sample2 8.45 8.10 8.90
# 1. Process the VCF file
# Extracts data for samples in 'variants.vcf.gz', filtering for 'PASS' variants,
# a minimum depth of 10 and exclude the haploid chromosome 'contig_C'.
python process_vcf.py test_data/variants.vcf.gz --prefix test_data/ --filter PASS --depth 10 --exclude_chroms contig_C
# 2. Calculate DIA for a specific sample
# Processes the output for 'Sample_3' sample generated in the previous step.
python calc_DIA.py test_data/Sample_3.txt test_data/Sample_3.html
# 3. Analyse the VCF file simply
# Perform a simple analysis for samples in 'variants.vcf.gz', filtering for 'PASS' variants,
# a minimum depth of 10 and exclude haploid chromosome 'contig_C'.
python simple_analyse.py test_data/variants.vcf.gz test_data/variants.tsv --exclude_chroms contig_CNote on Test Data: For privacy and security reasons, the sample data provided in the test_data/ directory has been truncated and censored. It is intended for demonstration only and does not contain any reasonable biology infomation.
If you have any bug report, suggestion or question about the project, feel free to
- File an issue
- Pull request
- Email me: doorholez@gmail.com
when using DIAG please cite:
- what can I say?
