A comprehensive annotation pipeline tailored for eukaryotic genomes, particularly those from less well-studied organisms.
Currently, Eukan installation is only supported via Docker and Conda.
The Docker image installs all external tools via conda (from the same environment.yml used for local installs), then builds fitild from source and optionally includes GeneMark.
git clone https://github.com/BFL-lab/eukan.git
cd eukan
docker build -t eukan -f docker/Dockerfile .To include GeneMark (license required), place gmes_linux_64*.tar.gz and gm_key_64.gz in the project root before building. If omitted, the build will succeed but eukan check will report GeneMark as missing.
A separate development image adds test tooling (NCBI datasets CLI, procps):
docker build -t eukan-dev -f docker/Dockerfile.dev .Installs all external tools via bioconda and eukan itself via pip, in one step:
git clone https://github.com/BFL-lab/eukan.git
cd eukan
conda env create -f environment.yml
conda activate eukan
eukan checkThe eukan CLI configures all required environment variables (e.g. $ZOE, $ALN_TAB, EVM paths) automatically at startup. If you need to run the underlying tools directly outside of eukan, install the optional activation script:
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
cp conda-activate.sh $CONDA_PREFIX/etc/conda/activate.d/eukan.sh
conda deactivate && conda activate eukanThe environment.yml is auto-generated from eukan/data/tools.toml. To regenerate after modifying tool versions: python scripts/generate-env.py.
Two tools require manual installation after creating the environment. A helper script handles both:
# Download GeneMark (license required) from:
# https://topaz.gatech.edu/GeneMark/license_download.cgi
# Place gmes_linux_64*.tar.gz and gm_key_64.gz in the current directory, then:
./scripts/install-extras.sh
# Or point to a GeneMark archive elsewhere:
./scripts/install-extras.sh --genemark-tar /path/to/gmes_linux_64_4.tar.gzeukan check will tell you exactly what's missing and how to install it.
Requires Python >= 3.11 and Poetry. External tools must be installed separately (via conda or Docker):
poetry install --with dev # omit --with dev for a runtime-only install
poetry run eukan --helpThe dev group adds pytest, ruff, and mypy.
Python (managed by Poetry): click, gffutils, biopython, pandas, requests, pydantic-settings.
External tools (via Docker image or conda): AUGUSTUS, SNAP, CodingQuarry, spaln, GenomeThreader, EVidenceModeler, PASA, Trinity, STAR, samtools, BLAT, jellyfish, GMAP/GSNAP, fasta36, TRF.
Manual install:
- GeneMark-ES/ET/EP+: license required
- fitild: github.com/ogotoh/fitild (only needed for spaln-based protein alignment in default fitild mode)
- spaln analysis utilities (
utn,npssm,exinpot, etc.): built from spaln source viamake all(only needed for--spspspecies-specific parameter mode)
Use eukan-docker as a wrapper to run any subcommand inside the Docker container. It bind-mounts the current directory and runs as your user:
The typical workflow runs each subcommand from the same working directory. Each subcommand writes into its own subdirectory under that working directory (repeats/, assemble/, annotate/, func-annot/, submission/), and downstream subcommands auto-discover outputs from earlier ones via the cross-pipeline artifact registry — no need to pass paths between steps.
# 1. Download reference databases
./eukan-docker db-fetch
# 2. Soft-mask repeats (optional but recommended for AUGUSTUS)
./eukan-docker mask-repeats -g genome.fasta
# 3. Assemble transcriptome from RNA-seq reads (optional but recommended)
./eukan-docker assemble \
-g genome.fasta \
-l left_reads.fastq -r right_reads.fastq \
-S RF --kingdom protist
# 4. Annotate: auto-discovers assembly outputs, repeat hints, and databases
./eukan-docker annotate -g genome.fasta -p proteins.fasta --kingdom protist
# 5. Functional annotation: auto-discovers predicted proteins + databases
./eukan-docker func-annot
# 6. NCBI submission prep (table2asn validator + .sqn)
./eukan-docker prep-submission -t template.sbt --organism "Genus species"
# Extract sequences from GFF3
./eukan-docker gff3toseq -g genome.fasta -i genes.gff3 --output-format protein -o proteins.faaEach auto-discovered input can be overridden explicitly (see subcommand docs below).
Set EUKAN_IMAGE to use a custom image name (default: eukan):
EUKAN_IMAGE=myregistry/eukan:latest ./eukan-docker annotate ...poetry run eukan annotate -g genome.fasta -p proteins.fasta --kingdom fungus
poetry run eukan assemble -g genome.fasta -l left.fq -r right.fq -S RFRun the genome annotation pipeline. When run in the same directory as eukan assemble, transcript evidence (FASTA, GFF3, RNA-seq hints), strand-specificity, and a PASA database for UTR addition are discovered automatically.
Usage: eukan annotate [OPTIONS]
Required input:
-g, --genome PATH Genome FASTA (no lower-case; pipeline soft-masks repeats). [required]
-p, --proteins PATH One or more protein FASTA files. [required]
Pipeline parameters:
-k, --kingdom [fungus|protist|animal|plant]
Target organism kingdom.
-n, --numcpu INTEGER Number of CPU threads. [default: all]
--existing-augustus TEXT Use pre-trained AUGUSTUS species parameters.
-w, --weights INTEGER Weights: protein, gene predictions, transcripts. [default: 2 1 3]
-c, --code INTEGER NCBI genetic code table number. [default: 11]
Override options:
-tf, --transcripts-fasta PATH Override auto-discovered transcript FASTA.
-tg, --transcripts-gff PATH Override auto-discovered transcript GFF3.
-r, --rnaseq-hints PATH Override auto-discovered RNA-seq hints GFF.
--strand-specific Transcripts are strand-oriented.
--utrs PATH PASA SQLite database for UTR addition.
--splice-permissive Allow non-canonical splice sites (GC-AG, AT-AC).
Experimental:
--spsp Build species-specific spaln parameters from transcripts
(alternative to fitild). See "Protein alignment modes" below.
Re-run steps:
--run-genemark Force re-run GeneMark gene prediction.
--run-prot-align Force re-run protein alignment (spaln/gth).
--run-augustus Force re-run AUGUSTUS training and prediction.
--run-snap Force re-run SNAP (and CodingQuarry) prediction.
--run-consensus Force re-run EVM consensus model building.
It's tempting to provide more protein sequence data to the pipeline with the idea of covering more of the genome, but typically this leads to a degradation in performance at worst, or strongly diminishing returns at best. This is mostly due to the presence of the same orthologs, paralogs and derived sequences across multiple different species within the database. The further those sequences are to the target genome, lower the useful signal in aligning them. If can often lead to artifacts or misleading evidence that can interfere with the consensus building process, unless down-weighted. It's more ideal to find ten proteomes of the closest neighboring species available, otherwise the UniProt-SwissProt verified collection is a good alternative.
Assemble transcriptome from RNA-seq reads for use with eukan annotate. Provide either paired-end reads (--left and --right) or single-end reads (--single).
Usage: eukan assemble [OPTIONS]
Required input:
-g, --genome PATH Genome FASTA. [required]
-l, --left PATH Left paired-end reads.
-r, --right PATH Right paired-end reads.
-s, --single PATH Single-end reads.
Pipeline parameters:
-n, --numcpu INTEGER Number of CPUs. [default: all]
-S, --strand-specific [RF|FR|R|F]
Strand-specific library type (RF/FR for paired, R/F for single).
-t, --align-mode [EndToEnd|Local] STAR alignment mode. [default: Local]
--splice-permissive Allow non-canonical splice sites (GC-AG, AT-AC).
-c, --code [1|6|10|12] NCBI genetic code for PASA. [default: 1]
-m, --min-intron INTEGER Min intron length. [default: 20]
-M, --max-intron INTEGER Max intron length. [default: 5000]
--phred [33|64] Phred quality score. [default: 33]
-j, --jaccard-clip Enable jaccard clipping.
Re-run steps:
-A, --run-star Force re-run STAR read mapping.
-T, --run-trinity Force re-run Trinity assembly.
-P, --run-pasa Force re-run PASA alignment.
-f, --force Force re-run all steps.
The pipeline runs STAR mapping, genome-guided + de novo Trinity assembly, and PASA alignment. STAR also profiles splice site types from junction evidence (splice_site_summary.json), which the annotation pipeline uses to allow non-canonical splice sites in AUGUSTUS. If no step flags (-A, -T, -P) are given, all steps run.
Immediately after STAR mapping, eukan assemble walks the sorted BAM once to produce softclip_diagnostic_summary.json. It surfaces two signals downstream gene prediction needs to know about:
- Trans-splicing — soft-clip ends are clustered by a K-bp substring anchored at the alignment boundary (K=12 by default), with Hamming-tolerant per-locus consolidation and cross-locus canonicalization. A few clusters supported by many genomic loci is the trans-splicing fingerprint (the splice-leader sequence sits on the 5' end of every trans-spliced transcript). The top non-low-complexity cluster's K-bp anchor key is reported alongside a per-column majority-vote consensus motif that extends past the anchor — for kinetoplastid datasets this typically recovers ~30–40 bp of the conserved SL. The verdict is categorical (
STRONG/MODERATE/ABSENT) and a WARNING is logged when present so the user knows reads may need splice-leader trimming before annotation. - Non-canonical splice usage — the per-N-cigar-op donor/acceptor dinucleotide histogram is rolled up into a
canonical_pctplus a top non-canonical pair. Verdict (EXTENSIVE/MODERATE/ABSENT) tracks the prevalence;EXTENSIVEis the signal that the genome warrants--splice-permissiveoneukan annotate.
The diagnostic is idempotent (skipped if its summary already exists in the work dir) and adds ~30s per 100M reads on top of STAR's runtime. Tune the consensus stringency with consensus_min_majority_fraction on diagnose_bam() when invoking the library directly; defaults are tuned for whole-genome RNA-seq.
Soft-mask repeats with RepeatModeler + RepeatMasker. The de novo families library is inferred by RepeatModeler, then RepeatMasker uses it to lower-case repetitive bases in the genome.
Usage: eukan mask-repeats [OPTIONS]
Required input:
-g, --genome PATH Genome FASTA. [required]
Pipeline parameters:
-n, --numcpu INTEGER Number of CPU threads. [default: all]
--engine [rmblast|ncbi] Search engine. [default: rmblast]
--lib PATH Pre-built repeat-family library FASTA. When set,
RepeatModeler is skipped.
Re-run steps:
--run-modeler Force re-run BuildDatabase + RepeatModeler.
--run-masker Force re-run RepeatMasker.
-f, --force Force re-run all steps.
Outputs (in the working directory):
<stem>.masked.fasta— soft-masked genome (lower-case in repeats).<stem>.repeats.gff— raw RepeatMasker GFF.hints_repeatmask.gff— AUGUSTUS-formatnonexonpart/src=RMhints.
Pass the masked genome to the next stage (eukan annotate -g <stem>.masked.fasta); AUGUSTUS auto-discovers hints_repeatmask.gff from the working directory and weights it via the RM extrinsic source already declared in the shipped augustus.config.
Add functional annotations to predicted proteins. When run after eukan annotate and eukan db-fetch, the predicted protein sequences and reference databases are discovered automatically.
Two homology sources are supported (selected with --homology-db); Pfam is searched in both modes:
--homology-db uniprot(default) — phmmer against UniProt-SwissProt. Broad coverage of curated proteins; emitsproduct=<description>andinference=similar to AA sequence:UniProtKB:<accession>.--homology-db kofam— KEGG Orthology assignment. hmmscan against the eukaryote subset of the KOfam HMM database, with per-KO bit-score thresholds (fullordomainscore depending on the KO) loaded fromko_list. Only KOs whose hit score meets the curated cutoff drive product/EC assignment; below-threshold hits are recorded but not promoted. Emitsproduct=<KO definition>,ec_number=<EC>(parsed out of[EC:…]tags in the definition),Dbxref=KEGG:K<number>, andinference=protein motif:KOFAM:K<number>.
KOfam mode is adapted from KofamKOALA: the database, per-KO thresholds,
eukaryote.halfilter, and full-vs-domain adaptive scoring all come from the KofamKOALA paper. If you use--homology-db kofam, please cite: Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020 Apr 1;36(7):2251–2252. doi:10.1093/bioinformatics/btz859.
Usage: eukan func-annot [OPTIONS]
Pipeline parameters:
-n, --numcpu INTEGER Number of CPUs. [default: all]
--homology-db [uniprot|kofam] Homology DB to run alongside Pfam.
[default: uniprot]
-e, --evalue TEXT E-value cutoff. [default: 1e-1]
Override options:
-p, --proteins PATH Amino acid sequences FASTA.
--uniprot PATH UniProt-SwissProt database FASTA.
--kofam PATH KOfam pressed HMM database.
--ko-list PATH KOfam ko_list TSV (per-KO thresholds + definitions).
--pfam PATH Pfam HMM database.
--gff3 PATH GFF3 file to annotate with functional info.
-f, --force Re-run steps even if outputs exist.
Runs the chosen homology search plus hmmscan against Pfam (both via pyhmmer). Produces:
input.mod.faa: annotated FASTA with functional descriptions in headers.input.mod.gff3: (if--gff3provided) GFF3 withproduct/inference(and, for KOfam,Dbxref/ec_number) attributes.
For UniProt mode, hits with e-values between 1e-3 and the cutoff are reported as marginal. For KOfam mode, hits below the per-KO bit-score threshold are tagged [marginal KO hit] in the FASTA description and do not contribute to the GFF3 product/Dbxref/inference lines — the KofamKOALA paper's evaluation showed that the adaptive thresholds yield substantially better precision/recall than a single global E-value cutoff.
Validate and package an annotated genome for NCBI submission. Wraps NCBI's table2asn with the standard recipe (-split-logs -W -J -Z -euk -T -V b plus -c/-M/-a), producing a .sqn upload-ready file alongside .val, .dr, and .stats validator reports for iterative GFF3 refinement. When run after eukan annotate and eukan func-annot, the genome (from eukan-run.json) and annotated GFF3 (final.mod.gff3, falling back to final.gff3) are auto-discovered.
Usage: eukan prep-submission [OPTIONS]
Required input:
-t, --template PATH NCBI submission template (.sbt). [required]
Source qualifiers:
--organism TEXT Organism scientific name (e.g. 'Homo sapiens').
Required unless --source-info is given.
--isolate TEXT Isolate / strain identifier.
-j, --source-info TEXT Raw -j string for table2asn (e.g.
'[organism=Foo] [isolate=Bar] [country=Canada]').
Overrides --organism / --isolate when set.
--locus-tag-prefix TEXT NCBI-registered locus tag prefix (required for
new-genome submissions).
Override options:
-g, --genome PATH Override auto-discovered genome FASTA.
-i, --gff3 PATH Override auto-discovered annotated GFF3.
Pipeline parameters:
--cleanup TEXT table2asn -c cleanup flags. [default: befw]
--mode TEXT table2asn -M flatfile mode. [default: n]
-a, --assembly-type TEXT table2asn -a assembly type / gap config.
[default: r10k]
--extra-args TEXT Extra table2asn arguments, shell-quoted
(e.g. --extra-args '-split-dr -huge').
--cleanup-gff3 / --no-cleanup-gff3
Pre-process the GFF3 (strip UniProt cruft, drop
CDS-less mRNAs, cap inferences) before handing
it to table2asn. [default: cleanup-gff3]
Output options:
-o, --output-file PATH Output .sqn path.
[default: <output-dir>/<genome-stem>.sqn]
-d, --output-dir PATH Output directory. [default: ./submission]
--print-command Print the resolved table2asn command and exit.
--dry-run Print the command and create the output dir,
but don't run table2asn.
The .sbt submission template must be created via NCBI's web form — it is the only mandatory file input. Outputs land in ./submission/:
<genome-stem>.sqn— GenBank-format submission file ready for upload.<genome-stem>.val— validator report (FATAL/ERROR/WARNING/INFO counts are logged on completion).<genome-stem>.dr— discrepancy report.<genome-stem>.stats— feature statistics.<gff3-stem>.cleaned.gff3— the post-cleanup GFF3 actually fed to table2asn (when--cleanup-gff3is enabled).
By default the GFF3 is pre-processed before table2asn sees it: UniProt-style metadata (OS=...OX=...GN=...PE=...SV=...) is stripped from product= values (without this, table2asn rewrites every product to "hypothetical protein", losing all functional annotation), (Fragment) suffixes are removed, mRNAs with no CDS children are dropped (along with newly-orphaned genes), and inference= lists are capped at three accessions per feature. Pass --no-cleanup-gff3 to bypass.
The intended workflow is iterative:
# First pass — typical errors surface as ERROR/WARNING in the .val report.
eukan prep-submission -t template.sbt --organism "Genus species"
# Inspect submission/<genome-stem>.val and .dr, refine the GFF3, re-run.
# Use --print-command to inspect the exact table2asn invocation without running it:
eukan prep-submission -t template.sbt --organism "Genus species" --print-commandThe command exits non-zero if table2asn reports any FATAL validation errors; the .val report path is included in the error message.
Extract protein or cDNA sequences from a GFF3 + genome.
Usage: eukan gff3toseq [OPTIONS]
Required input:
-g, --genome PATH Genome FASTA. [required]
-i, --gff3 PATH GFF3 with gene models. [required]
Pipeline parameters:
--output-format [protein|cdna] Output sequence type. [default: protein]
-c, --code INTEGER NCBI genetic code table number. [default: 1]
Output options:
-o, --output-file FILENAME Write FASTA to this file ('-' for stdout).
[default: stdout]
Download reference databases. Pfam is always fetched; the homology DB to pair it with is chosen by --homology-db.
Usage: eukan db-fetch [OPTIONS]
Options:
-o, --output-dir PATH Directory to download into. [default: databases]
--homology-db [uniprot|kofam] Homology DB to fetch alongside Pfam.
[default: uniprot]
-f, --force Re-download even if databases are up to date.
-d, --database [uniprot|pfam|kofam|ko_list]
Specific database(s) to fetch. Overrides
--homology-db when given.
Downloads and prepares:
Pfam-A.hmm— Pfam HMM profiles (decompressed and pressed for hmmscan). Always fetched.uniprot_sprot.faa— UniProt-SwissProt protein sequences (--homology-db uniprot, default).kofam_eukaryote.hmm+ko_list.tsv— KOfam profiles and per-KO bit-score thresholds (--homology-db kofam). The fetcher downloads KofamKOALA'sprofiles.tar.gzandko_list.gzfrom https://www.genome.jp/ftp/db/kofam/, filters profiles via the shippedeukaryote.hallist (~16k of ~27k KOs), concatenates them into a single HMM file, and presses it.ko_listis the per-KO metadata table used byfunc-annot --homology-db kofam; see theeukan func-annotsection for the KofamKOALA citation.
Each database has an entry in databases/.manifest.json (md5 + source URL + download date); re-running eukan db-fetch is a no-op when files match the manifest. Use -d <name> for a targeted refresh of a single file (e.g. -d pfam to refresh only Pfam without touching the homology DB).
Compare predicted gene models against a reference or previous annotation to assess annotation quality. Reports gene-level classification (exact, inexact, missing, merged, fragmented, novel), subfeature-level metrics (mRNA, CDS, intron), and overlap-based sensitivity/specificity/F1 scores. Repeat --predicted to compare several predictions against the same reference and tabulate, per gene-level classification, which subsets of predictions agreed.
Usage: eukan compare [OPTIONS]
Required input:
-r, --reference PATH Reference GFF3 file.
-p, --predicted PATH Predicted GFF3 file. Repeat to compare multiple
predictions against the same reference.
Pipeline parameters:
-L, --label TEXT Short label per --predicted (must appear once
per --predicted, or be omitted to use file
stems).
Output options:
-o, --output-file PATH Write per-feature details to a TSV file. In
multi-prediction mode a leading 'prediction'
column is prepended.
The classification system and metrics are further described in the paper referenced in Citation. Gene-level classifications:
- exact: prediction coordinates match reference exactly.
- inexact: prediction overlaps a single reference with boundary differences.
- missing: reference gene with no overlapping prediction (false negative).
- merged: prediction spans 2+ reference genes.
- fragmented: 2+ predictions each cover a single reference gene.
- novel: prediction with no reference overlap (possibly false positive).
For matched features, reports overlap-based sensitivity (overlap / reference length), specificity (overlap / prediction length), and F1 score. Boundary differences (5' and 3') are reported for inexact matches.
# Single prediction
eukan compare -r reference.gff3 -p predicted.gff3
# Single prediction with per-feature TSV
eukan compare -r reference.gff3 -p predicted.gff3 -o details.tsv
# Multiple predictions against the same reference
eukan compare -r reference.gff3 \
-p eukan.gff3 -p braker.gff3 -p maker.gff3 \
-L Eukan -L Braker -L Maker \
-o details.tsvWhen two or more --predicted inputs are supplied, the report adds a comparative summary on top of the per-prediction breakdowns:
- F1 by level / prediction — count-based F1 at gene, mRNA, CDS, and intron level for each prediction, side by side.
- Per-class powerset — for each gene-level classification (
match= exact|inexact,missing,merged,fragmented), and for each reference gene, records the sorted tuple of prediction labels whose classification of that gene was that class, then tallies these tuples. Counts within a class sum to the number of reference genes; the(none)row counts genes no prediction assigned to that class. Full enumeration up to N=6 predictions; condensed (shared-by-all / shared-by-none / uniquely-per-pred) above.
--label defaults to each prediction's filename stem; if two predictions share a stem they are auto-numbered with a warning. Use --label to set explicit names; counts must match --predicted.
The N=1 path is unchanged from earlier releases — the comparative section is omitted entirely and the per-feature TSV keeps the legacy schema.
Verify Python dependencies, external tools, and databases.
Usage: eukan check [OPTIONS]
Options:
--for [annotate|assemble|mask-repeats|func-annot|db-fetch|prep-submission]
Only check tools needed by these subcommands. If omitted, check all.
--db-dir PATH Database directory to check. [default: databases]
Checks Python dependencies, probes each external tool with a version/help command, and verifies database integrity. Exits 0 if all checks pass, 1 if any fail.
# Check everything
eukan check
# Check only what's needed for annotation
eukan check --for annotate
# Check multiple subcommands
eukan check --for annotate --for assembleExample output:
Checked 14 external tools:
12 tools OK:
✓ samtools samtools 1.20
✓ AUGUSTUS AUGUSTUS (3.5.0)
...
2 tools MISSING or BROKEN:
✗ codingquarry `CodingQuarry` not found on PATH; env not set: $QUARRY_PATH
used by: annotate
✗ fitild `fitild` not found on PATH
used by: annotate
hint: Build from source: git clone https://github.com/ogotoh/fitild ...
The annotation pipeline (eukan annotate) runs the following steps:
- ORF finding: Identify ORFs in transcript assemblies (if provided). Uses the configured genetic code (
-c/--code) so that alternative stop codons (e.g., code 6 where TAA/TAG encode glutamine) are handled correctly. - GeneMark: Ab initio gene prediction (ES mode, or ET mode if RNA-seq intron hints are available with >= 150 introns). Passes
--gcodefor non-standard genetic codes (codes 6 and 26). - Protein alignment: Spliced alignment via spaln (intron-rich genomes, > 25% introns/gene) or GenomeThreader (intron-poor). See Protein alignment modes.
- AUGUSTUS: Train species-specific parameters from concordant GeneMark/protein models, then predict genes using protein + RNA-seq hints. Non-canonical splice sites (e.g., AT-AC) are allowed automatically when supported by sufficient junction evidence from STAR;
--splice-permissivelowers the evidence thresholds. - SNAP: Train and predict (all kingdoms). CodingQuarry also runs for fungus/protist genomes.
- EVidenceModeler: Build weighted consensus gene models from all evidence sources.
- PASA UTRs: Add UTR annotations and model alternative splicing from the transcriptome database.
- Final output: Assign sequential locus tags and correct CDS phases. Non-overlapping transcript ORFs not captured by EVM are patched into the final model set.
Output: final.gff3 in the working directory.
spaln protein alignment supports two modes for modeling intron structure:
Default (fitild): Builds an intron length distribution from GeneMark predictions and feeds it to spaln via the -yI parameter. This is the established approach and requires the fitild tool.
Species-specific parameters (--spsp): Uses transcript data to build full species-specific spaln parameters (splice site models, intron potential, codon usage) via spaln's make_eij.pl and make_ssp.pl scripts. This produces richer alignment parameters than fitild alone, at the cost of additional computation. Requires transcript data (from eukan assemble or provided via --transcripts-fasta).
# Default mode (fitild)
eukan annotate -g genome.fasta -p proteins.fasta
# Species-specific parameter mode (experimental)
eukan annotate -g genome.fasta -p proteins.fasta --spspWhen --spsp is used, protein alignment results are written to a separate directory (prot_align_ssp/) so both modes can coexist for comparison.
All pipelines write to a shared eukan-run.json manifest in the working directory, tracking per-step status, timing, and output checksums. This enables:
- Resume: Re-running a subcommand skips steps that already completed.
- Selective re-run: Use
--run-*flags to force specific steps to re-execute. - Integrity checking: On startup, completed steps are validated (output exists and is non-empty). If a discrepancy is found, the pipeline reports the issue and suggests the
--run-*flag to fix it. - Progress monitoring:
eukan statusreads the manifest and shows progress across all pipelines.
# View run status
eukan status
# Re-run only protein alignment
eukan annotate -g genome.fasta -p proteins.fasta --run-prot-align
# Re-run only PASA in assembly
eukan assemble -g genome.fasta -l left.fq -r right.fq -Ppoetry run pytest tests/ -vUnit tests cover GFF3 parsing/serialization, genomic interval operations, ORF finding, configuration validation, run manifest tracking, database integrity checks, and CLI wiring. They run without external tools or network access.
A development CLI at tests/run_pipeline.py drives a full end-to-end pipeline run using S. pombe chromosome III as the test organism.
- All external tools installed (Docker or conda environment; verify with
eukan check) - NCBI datasets CLI and SRA Toolkit on PATH (for downloading test data)
When using Docker, build and use the dev image (eukan-dev) which includes the NCBI datasets CLI:
docker build -t eukan-dev -f docker/Dockerfile.dev .python tests/run_pipeline.py setup-test-data [-o tests/data]Downloads from NCBI:
- Genome: S. pombe chromosome III (
NC_003424.3, ~2.5 Mbp) - Proteins: 10 close neighbor proteomes via
datasets - RNA-seq: 5 SRA paired-end runs via
prefetch+fasterq-dump
Accession lists live in tests/data/*.txt and are never deleted by cleanup. Downloads are idempotent.
# Full run: assembly + annotation (default kingdom: fungus)
python tests/run_pipeline.py test-pipeline --kingdom fungus -n 8
# Protein-only: skip transcriptome assembly
python tests/run_pipeline.py test-pipeline --protein-only -n 8
# Custom directories
python tests/run_pipeline.py test-pipeline -d tests/data -w tests/pipeline-runThe test pipeline runs:
- Transcriptome assembly: STAR read mapping, Trinity (genome-guided + de novo), PASA alignment
- Genome annotation: GeneMark, protein alignment (spaln/gth), AUGUSTUS, SNAP, EVM consensus
If assembly fails, it falls back to protein-only annotation automatically.
Output lands in tests/pipeline-run/ with one subdirectory per pipeline step (assemble/, annotate/, func-annot/, submission/). View run details with eukan status -d tests/pipeline-run/annotate. The end-of-run summary reports the gene/mRNA counts from final.gff3 plus the percentage of mRNAs that received a UniProt or Pfam inference.
# Remove pipeline outputs only
python tests/run_pipeline.py clean-test-data
# Remove outputs + downloaded data (genome, proteins, reads)
python tests/run_pipeline.py clean-test-data --allAll three subcommands accept -h for detailed help.
If you use eukan, please cite:
Sarrasin M, Burger G, Lang BF. Eukan: a fully automated nuclear genome annotation pipeline for less studied and divergent eukaryotes. NAR Genomics and Bioinformatics. 2026 Mar;8(1):lqag003. doi:10.1093/nargab/lqag003
See LICENSE.md.