Skip to content

doorholez/DIAG_toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DIAG: A standardized metric for WGA quality control

DIAG toolkit provides scripts for estimating Depth of Independent Amplicons (DIA) metrics from VCF files. It consists of a pre-processing script to extract relevant allele depths and a calculation script to compute metrics and generate visualization reports.
For bugs, suggestions, questions, collaborations etc. please contact MengDong Zhang

Scripts

  1. process_vcf.py: Parses a VCF file to extract allele depths for experiment samples at sites where the bulk sample is heterozygous.
  2. calc_DIA.py: Calculates DIA metrics, performs bootstrap analysis, and generates interactive HTML plots.
  3. simple_analyse.py: Calculates whole-genome DIA for multiple samples with default settings directly from a VCF file.

Installation

  1. Clone the repository:

    git clone https://github.com/doorholez/DIAG_toolkit.git
    cd DIAG_toolkit
  2. Install dependencies:

    The scripts require Python 3 (>= 3.8) and the following libraries:

    • numpy
    • pandas
    • scipy
    • plotly
    • natsort

    You can install them using pip:

    pip install numpy pandas scipy plotly natsort

Usage

1. Pre-processing (process_vcf.py)

This script takes a VCF file as input. It assumes the first sample in the VCF is the "Bulk" sample. It identifies heterozygous sites in this reference sample and extracts the corresponding allele depths for all subsequent "Experiment" samples.

Syntax:

python process_vcf.py [options] <input_vcf>

Arguments:

  • input_vcf: Path to the input VCF file (supports .vcf and .vcf.gz).
  • -p, --prefix: Prefix string to add to output filenames.
  • -f, --filter: Filter string (e.g., PASS). Only variants with this exact value in the FILTER column are processed.
  • -R, --region: Limit processing to a specific genomic region (format: chrom:start-end).
  • -e, --exclude_chroms: Space-separated list of chromosomes to exclude (e.g., chrX chrY chrM). Excluding haploid chromosomes can improve the accuracy of DIA estimation.
  • -d, --depth: Minimum depth threshold. Sites with total depth lower than this value (in either bulk or experiment) are skipped.

Output:

Generates one text file per experiment sample named {prefix}{SampleName}.txt. The file format is tab-separated:

#chrom  pos     depth1  depth2
chr1    1000    20      15
...

2. Calculation & Visualization (calc_DIA.py)

This script takes the text file generated by process_vcf.py and performs the DIA calculation. You can generate your own input file if you have custom requirements. It splits the genome into regions, estimates allele frequencies and DIA for both whole genome and regions, and produces an interactive HTML report.

Syntax:

python calc_DIA.py [options] <input_file> [output_file]

Arguments:

  • input_file: Path to the input text file. See Input File Format below.
  • output_file: (Optional) Path for the output HTML report. Defaults to <input_filename>.html.
  • -r, --region_size: Size of genomic regions to split for analysis (default: 10,000,000 bp).
  • -b, --bootstrap: Number of bootstrap iterations for whole-genome estimation (default: 1000).
  • -br, --bootstrap_region: Number of bootstrap iterations for split regions (default: 0.5 * global bootstrap).
  • -q, --quiet: Quiet mode (suppresses standard output, keeps warnings).
  • -s, --silent: Silent mode (suppresses all output).

Input File Format:

The input file is a whitespace-separated text file. Lines starting with # are ignored. Each data line must contain 4 or 5 columns:

  1. chrom: Chromosome name.
  2. pos: Genomic position.
  3. depth1: Depth of the first allele (e.g., Reference depth).
  4. depth2: Depth of the second allele (e.g., Alternative depth).
  5. ploidy: (Optional) Local ploidy. Defaults to 2 if omitted.

Example:

#chrom  pos     depth1  depth2  ploidy
chr1    1000    20      15      3
chr1    1050    10      12

Output:

An HTML file containing:

  1. Whole Genome DIA Bootstrap Distribution: Density distribution plot of bootstrap results.
  2. Copy Number & Allele Frequency: Genomic tracks showing estimated allele frequency and ploidy.
  3. Split Region DIA: Genomic track showing DIA values with 95% confidence intervals.

3. Simple Analysis (simple_analyse.py)

This script performs a simple whole-genome DIA analysis with default settings for all experiment samples in a VCF file directly. It skips region splitting and assumes a global ploidy (p=0.5), making it suitable for rapid quality control of multiple samples.

Syntax:

python simple_analyse.py [options] <input_vcf> [output_tsv]

Arguments:

  • input_vcf: Path to the input VCF file (supports .vcf and .vcf.gz).
  • output_tsv: (Optional) Path for the output TSV file. Defaults to <input_vcf_basename>_samples_eIA.tsv.
  • -b, --bootstrap: Number of bootstrap iterations (default: 1000).
  • -e, --exclude_chroms: Space-separated list of chromosomes to exclude (e.g., chrX chrY chrM). Excluding haploid chromosomes can improve the accuracy of DIA estimation.

Output:

A TSV file containing DIA metrics for each sample:

sample_name     DIA    bootstrap_lower bootstrap_upper
Sample1         15.23   14.50           16.10
Sample2         8.45    8.10            8.90

Example Workflow

# 1. Process the VCF file
# Extracts data for samples in 'variants.vcf.gz', filtering for 'PASS' variants,
# a minimum depth of 10 and exclude the haploid chromosome 'contig_C'.
python process_vcf.py test_data/variants.vcf.gz --prefix test_data/ --filter PASS --depth 10 --exclude_chroms contig_C

# 2. Calculate DIA for a specific sample
# Processes the output for 'Sample_3' sample generated in the previous step.
python calc_DIA.py test_data/Sample_3.txt test_data/Sample_3.html

# 3. Analyse the VCF file simply
# Perform a simple analysis for samples in 'variants.vcf.gz', filtering for 'PASS' variants,
# a minimum depth of 10 and exclude haploid chromosome 'contig_C'.
python simple_analyse.py test_data/variants.vcf.gz test_data/variants.tsv --exclude_chroms contig_C

Note on Test Data: For privacy and security reasons, the sample data provided in the test_data/ directory has been truncated and censored. It is intended for demonstration only and does not contain any reasonable biology infomation.

Feedback and Contribution

If you have any bug report, suggestion or question about the project, feel free to

Citation

when using DIAG please cite:

  • what can I say?

About

DIAG toolkit provides scripts for estimating Depth of Independent Amplicons (DIA) metrics from VCF files.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages