Skip to content

BFL-lab/eukan

Repository files navigation

eukan: Eukaryotic Genome Annotation Pipeline

A comprehensive annotation pipeline tailored for eukaryotic genomes, particularly those from less well-studied organisms.

Installation

Currently, Eukan installation is only supported via Docker and Conda.

Docker

The Docker image installs all external tools via conda (from the same environment.yml used for local installs), then builds fitild from source and optionally includes GeneMark.

git clone https://github.com/BFL-lab/eukan.git
cd eukan
docker build -t eukan -f docker/Dockerfile .

To include GeneMark (license required), place gmes_linux_64*.tar.gz and gm_key_64.gz in the project root before building. If omitted, the build will succeed but eukan check will report GeneMark as missing.

A separate development image adds test tooling (NCBI datasets CLI, procps):

docker build -t eukan-dev -f docker/Dockerfile.dev .

Conda

Installs all external tools via bioconda and eukan itself via pip, in one step:

git clone https://github.com/BFL-lab/eukan.git
cd eukan
conda env create -f environment.yml
conda activate eukan
eukan check

The eukan CLI configures all required environment variables (e.g. $ZOE, $ALN_TAB, EVM paths) automatically at startup. If you need to run the underlying tools directly outside of eukan, install the optional activation script:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
cp conda-activate.sh $CONDA_PREFIX/etc/conda/activate.d/eukan.sh
conda deactivate && conda activate eukan

The environment.yml is auto-generated from eukan/data/tools.toml. To regenerate after modifying tool versions: python scripts/generate-env.py.

Two tools require manual installation after creating the environment. A helper script handles both:

# Download GeneMark (license required) from:
#   https://topaz.gatech.edu/GeneMark/license_download.cgi
# Place gmes_linux_64*.tar.gz and gm_key_64.gz in the current directory, then:
./scripts/install-extras.sh

# Or point to a GeneMark archive elsewhere:
./scripts/install-extras.sh --genemark-tar /path/to/gmes_linux_64_4.tar.gz

eukan check will tell you exactly what's missing and how to install it.

Local development

Requires Python >= 3.11 and Poetry. External tools must be installed separately (via conda or Docker):

poetry install --with dev   # omit --with dev for a runtime-only install
poetry run eukan --help

The dev group adds pytest, ruff, and mypy.

Dependencies

Python (managed by Poetry): click, gffutils, biopython, pandas, requests, pydantic-settings.

External tools (via Docker image or conda): AUGUSTUS, SNAP, CodingQuarry, spaln, GenomeThreader, EVidenceModeler, PASA, Trinity, STAR, samtools, BLAT, jellyfish, GMAP/GSNAP, fasta36, TRF.

Manual install:

  • GeneMark-ES/ET/EP+: license required
  • fitild: github.com/ogotoh/fitild (only needed for spaln-based protein alignment in default fitild mode)
  • spaln analysis utilities (utn, npssm, exinpot, etc.): built from spaln source via make all (only needed for --spsp species-specific parameter mode)

Usage

Docker

Use eukan-docker as a wrapper to run any subcommand inside the Docker container. It bind-mounts the current directory and runs as your user:

The typical workflow runs each subcommand from the same working directory. Each subcommand writes into its own subdirectory under that working directory (repeats/, assemble/, annotate/, func-annot/, submission/), and downstream subcommands auto-discover outputs from earlier ones via the cross-pipeline artifact registry — no need to pass paths between steps.

# 1. Download reference databases
./eukan-docker db-fetch

# 2. Soft-mask repeats (optional but recommended for AUGUSTUS)
./eukan-docker mask-repeats -g genome.fasta

# 3. Assemble transcriptome from RNA-seq reads (optional but recommended)
./eukan-docker assemble \
    -g genome.fasta \
    -l left_reads.fastq -r right_reads.fastq \
    -S RF --kingdom protist

# 4. Annotate:  auto-discovers assembly outputs, repeat hints, and databases
./eukan-docker annotate -g genome.fasta -p proteins.fasta --kingdom protist

# 5. Functional annotation:  auto-discovers predicted proteins + databases
./eukan-docker func-annot

# 6. NCBI submission prep (table2asn validator + .sqn)
./eukan-docker prep-submission -t template.sbt --organism "Genus species"

# Extract sequences from GFF3
./eukan-docker gff3toseq -g genome.fasta -i genes.gff3 --output-format protein -o proteins.faa

Each auto-discovered input can be overridden explicitly (see subcommand docs below).

Set EUKAN_IMAGE to use a custom image name (default: eukan):

EUKAN_IMAGE=myregistry/eukan:latest ./eukan-docker annotate ...

Local (development)

poetry run eukan annotate -g genome.fasta -p proteins.fasta --kingdom fungus
poetry run eukan assemble -g genome.fasta -l left.fq -r right.fq -S RF

Subcommands

eukan annotate

Run the genome annotation pipeline. When run in the same directory as eukan assemble, transcript evidence (FASTA, GFF3, RNA-seq hints), strand-specificity, and a PASA database for UTR addition are discovered automatically.

Usage: eukan annotate [OPTIONS]

Required input:
  -g, --genome PATH               Genome FASTA (no lower-case; pipeline soft-masks repeats). [required]
  -p, --proteins PATH             One or more protein FASTA files. [required]

Pipeline parameters:
  -k, --kingdom [fungus|protist|animal|plant]
                                   Target organism kingdom.
  -n, --numcpu INTEGER             Number of CPU threads. [default: all]
  --existing-augustus TEXT          Use pre-trained AUGUSTUS species parameters.
  -w, --weights INTEGER            Weights: protein, gene predictions, transcripts. [default: 2 1 3]
  -c, --code INTEGER               NCBI genetic code table number. [default: 11]

Override options:
  -tf, --transcripts-fasta PATH   Override auto-discovered transcript FASTA.
  -tg, --transcripts-gff PATH     Override auto-discovered transcript GFF3.
  -r, --rnaseq-hints PATH         Override auto-discovered RNA-seq hints GFF.
  --strand-specific                Transcripts are strand-oriented.
  --utrs PATH                      PASA SQLite database for UTR addition.
  --splice-permissive              Allow non-canonical splice sites (GC-AG, AT-AC).

Experimental:
  --spsp                           Build species-specific spaln parameters from transcripts
                                   (alternative to fitild). See "Protein alignment modes" below.

Re-run steps:
  --run-genemark                   Force re-run GeneMark gene prediction.
  --run-prot-align                 Force re-run protein alignment (spaln/gth).
  --run-augustus                    Force re-run AUGUSTUS training and prediction.
  --run-snap                       Force re-run SNAP (and CodingQuarry) prediction.
  --run-consensus                  Force re-run EVM consensus model building.

Notes on annotate input

It's tempting to provide more protein sequence data to the pipeline with the idea of covering more of the genome, but typically this leads to a degradation in performance at worst, or strongly diminishing returns at best. This is mostly due to the presence of the same orthologs, paralogs and derived sequences across multiple different species within the database. The further those sequences are to the target genome, lower the useful signal in aligning them. If can often lead to artifacts or misleading evidence that can interfere with the consensus building process, unless down-weighted. It's more ideal to find ten proteomes of the closest neighboring species available, otherwise the UniProt-SwissProt verified collection is a good alternative.

eukan assemble

Assemble transcriptome from RNA-seq reads for use with eukan annotate. Provide either paired-end reads (--left and --right) or single-end reads (--single).

Usage: eukan assemble [OPTIONS]

Required input:
  -g, --genome PATH               Genome FASTA. [required]
  -l, --left PATH                 Left paired-end reads.
  -r, --right PATH                Right paired-end reads.
  -s, --single PATH               Single-end reads.

Pipeline parameters:
  -n, --numcpu INTEGER             Number of CPUs. [default: all]
  -S, --strand-specific [RF|FR|R|F]
                                   Strand-specific library type (RF/FR for paired, R/F for single).
  -t, --align-mode [EndToEnd|Local] STAR alignment mode. [default: Local]
  --splice-permissive              Allow non-canonical splice sites (GC-AG, AT-AC).
  -c, --code [1|6|10|12]           NCBI genetic code for PASA. [default: 1]
  -m, --min-intron INTEGER         Min intron length. [default: 20]
  -M, --max-intron INTEGER         Max intron length. [default: 5000]
  --phred [33|64]                  Phred quality score. [default: 33]
  -j, --jaccard-clip               Enable jaccard clipping.

Re-run steps:
  -A, --run-star                   Force re-run STAR read mapping.
  -T, --run-trinity                Force re-run Trinity assembly.
  -P, --run-pasa                   Force re-run PASA alignment.
  -f, --force                      Force re-run all steps.

The pipeline runs STAR mapping, genome-guided + de novo Trinity assembly, and PASA alignment. STAR also profiles splice site types from junction evidence (splice_site_summary.json), which the annotation pipeline uses to allow non-canonical splice sites in AUGUSTUS. If no step flags (-A, -T, -P) are given, all steps run.

Soft-clip + intron BAM diagnostic

Immediately after STAR mapping, eukan assemble walks the sorted BAM once to produce softclip_diagnostic_summary.json. It surfaces two signals downstream gene prediction needs to know about:

  • Trans-splicing — soft-clip ends are clustered by a K-bp substring anchored at the alignment boundary (K=12 by default), with Hamming-tolerant per-locus consolidation and cross-locus canonicalization. A few clusters supported by many genomic loci is the trans-splicing fingerprint (the splice-leader sequence sits on the 5' end of every trans-spliced transcript). The top non-low-complexity cluster's K-bp anchor key is reported alongside a per-column majority-vote consensus motif that extends past the anchor — for kinetoplastid datasets this typically recovers ~30–40 bp of the conserved SL. The verdict is categorical (STRONG / MODERATE / ABSENT) and a WARNING is logged when present so the user knows reads may need splice-leader trimming before annotation.
  • Non-canonical splice usage — the per-N-cigar-op donor/acceptor dinucleotide histogram is rolled up into a canonical_pct plus a top non-canonical pair. Verdict (EXTENSIVE / MODERATE / ABSENT) tracks the prevalence; EXTENSIVE is the signal that the genome warrants --splice-permissive on eukan annotate.

The diagnostic is idempotent (skipped if its summary already exists in the work dir) and adds ~30s per 100M reads on top of STAR's runtime. Tune the consensus stringency with consensus_min_majority_fraction on diagnose_bam() when invoking the library directly; defaults are tuned for whole-genome RNA-seq.

eukan mask-repeats

Soft-mask repeats with RepeatModeler + RepeatMasker. The de novo families library is inferred by RepeatModeler, then RepeatMasker uses it to lower-case repetitive bases in the genome.

Usage: eukan mask-repeats [OPTIONS]

Required input:
  -g, --genome PATH               Genome FASTA. [required]

Pipeline parameters:
  -n, --numcpu INTEGER            Number of CPU threads. [default: all]
  --engine [rmblast|ncbi]         Search engine. [default: rmblast]
  --lib PATH                      Pre-built repeat-family library FASTA. When set,
                                  RepeatModeler is skipped.

Re-run steps:
  --run-modeler                   Force re-run BuildDatabase + RepeatModeler.
  --run-masker                    Force re-run RepeatMasker.
  -f, --force                     Force re-run all steps.

Outputs (in the working directory):

  • <stem>.masked.fasta — soft-masked genome (lower-case in repeats).
  • <stem>.repeats.gff — raw RepeatMasker GFF.
  • hints_repeatmask.gff — AUGUSTUS-format nonexonpart/src=RM hints.

Pass the masked genome to the next stage (eukan annotate -g <stem>.masked.fasta); AUGUSTUS auto-discovers hints_repeatmask.gff from the working directory and weights it via the RM extrinsic source already declared in the shipped augustus.config.

eukan func-annot

Add functional annotations to predicted proteins. When run after eukan annotate and eukan db-fetch, the predicted protein sequences and reference databases are discovered automatically.

Two homology sources are supported (selected with --homology-db); Pfam is searched in both modes:

  • --homology-db uniprot (default) — phmmer against UniProt-SwissProt. Broad coverage of curated proteins; emits product=<description> and inference=similar to AA sequence:UniProtKB:<accession>.
  • --homology-db kofam — KEGG Orthology assignment. hmmscan against the eukaryote subset of the KOfam HMM database, with per-KO bit-score thresholds (full or domain score depending on the KO) loaded from ko_list. Only KOs whose hit score meets the curated cutoff drive product/EC assignment; below-threshold hits are recorded but not promoted. Emits product=<KO definition>, ec_number=<EC> (parsed out of [EC:…] tags in the definition), Dbxref=KEGG:K<number>, and inference=protein motif:KOFAM:K<number>.

KOfam mode is adapted from KofamKOALA: the database, per-KO thresholds, eukaryote.hal filter, and full-vs-domain adaptive scoring all come from the KofamKOALA paper. If you use --homology-db kofam, please cite: Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020 Apr 1;36(7):2251–2252. doi:10.1093/bioinformatics/btz859.

Usage: eukan func-annot [OPTIONS]

Pipeline parameters:
  -n, --numcpu INTEGER           Number of CPUs. [default: all]
  --homology-db [uniprot|kofam]  Homology DB to run alongside Pfam.
                                 [default: uniprot]
  -e, --evalue TEXT              E-value cutoff. [default: 1e-1]

Override options:
  -p, --proteins PATH    Amino acid sequences FASTA.
  --uniprot PATH         UniProt-SwissProt database FASTA.
  --kofam PATH           KOfam pressed HMM database.
  --ko-list PATH         KOfam ko_list TSV (per-KO thresholds + definitions).
  --pfam PATH            Pfam HMM database.
  --gff3 PATH            GFF3 file to annotate with functional info.
  -f, --force            Re-run steps even if outputs exist.

Runs the chosen homology search plus hmmscan against Pfam (both via pyhmmer). Produces:

  • input.mod.faa: annotated FASTA with functional descriptions in headers.
  • input.mod.gff3: (if --gff3 provided) GFF3 with product/inference (and, for KOfam, Dbxref/ec_number) attributes.

For UniProt mode, hits with e-values between 1e-3 and the cutoff are reported as marginal. For KOfam mode, hits below the per-KO bit-score threshold are tagged [marginal KO hit] in the FASTA description and do not contribute to the GFF3 product/Dbxref/inference lines — the KofamKOALA paper's evaluation showed that the adaptive thresholds yield substantially better precision/recall than a single global E-value cutoff.

eukan prep-submission

Validate and package an annotated genome for NCBI submission. Wraps NCBI's table2asn with the standard recipe (-split-logs -W -J -Z -euk -T -V b plus -c/-M/-a), producing a .sqn upload-ready file alongside .val, .dr, and .stats validator reports for iterative GFF3 refinement. When run after eukan annotate and eukan func-annot, the genome (from eukan-run.json) and annotated GFF3 (final.mod.gff3, falling back to final.gff3) are auto-discovered.

Usage: eukan prep-submission [OPTIONS]

Required input:
  -t, --template PATH         NCBI submission template (.sbt). [required]

Source qualifiers:
  --organism TEXT             Organism scientific name (e.g. 'Homo sapiens').
                              Required unless --source-info is given.
  --isolate TEXT              Isolate / strain identifier.
  -j, --source-info TEXT      Raw -j string for table2asn (e.g.
                              '[organism=Foo] [isolate=Bar] [country=Canada]').
                              Overrides --organism / --isolate when set.
  --locus-tag-prefix TEXT     NCBI-registered locus tag prefix (required for
                              new-genome submissions).

Override options:
  -g, --genome PATH           Override auto-discovered genome FASTA.
  -i, --gff3 PATH             Override auto-discovered annotated GFF3.

Pipeline parameters:
  --cleanup TEXT              table2asn -c cleanup flags. [default: befw]
  --mode TEXT                 table2asn -M flatfile mode. [default: n]
  -a, --assembly-type TEXT    table2asn -a assembly type / gap config.
                              [default: r10k]
  --extra-args TEXT           Extra table2asn arguments, shell-quoted
                              (e.g. --extra-args '-split-dr -huge').
  --cleanup-gff3 / --no-cleanup-gff3
                              Pre-process the GFF3 (strip UniProt cruft, drop
                              CDS-less mRNAs, cap inferences) before handing
                              it to table2asn. [default: cleanup-gff3]

Output options:
  -o, --output-file PATH      Output .sqn path.
                              [default: <output-dir>/<genome-stem>.sqn]
  -d, --output-dir PATH       Output directory. [default: ./submission]
  --print-command             Print the resolved table2asn command and exit.
  --dry-run                   Print the command and create the output dir,
                              but don't run table2asn.

The .sbt submission template must be created via NCBI's web form — it is the only mandatory file input. Outputs land in ./submission/:

  • <genome-stem>.sqn — GenBank-format submission file ready for upload.
  • <genome-stem>.val — validator report (FATAL/ERROR/WARNING/INFO counts are logged on completion).
  • <genome-stem>.dr — discrepancy report.
  • <genome-stem>.stats — feature statistics.
  • <gff3-stem>.cleaned.gff3 — the post-cleanup GFF3 actually fed to table2asn (when --cleanup-gff3 is enabled).

By default the GFF3 is pre-processed before table2asn sees it: UniProt-style metadata (OS=...OX=...GN=...PE=...SV=...) is stripped from product= values (without this, table2asn rewrites every product to "hypothetical protein", losing all functional annotation), (Fragment) suffixes are removed, mRNAs with no CDS children are dropped (along with newly-orphaned genes), and inference= lists are capped at three accessions per feature. Pass --no-cleanup-gff3 to bypass.

The intended workflow is iterative:

# First pass — typical errors surface as ERROR/WARNING in the .val report.
eukan prep-submission -t template.sbt --organism "Genus species"

# Inspect submission/<genome-stem>.val and .dr, refine the GFF3, re-run.
# Use --print-command to inspect the exact table2asn invocation without running it:
eukan prep-submission -t template.sbt --organism "Genus species" --print-command

The command exits non-zero if table2asn reports any FATAL validation errors; the .val report path is included in the error message.

eukan gff3toseq

Extract protein or cDNA sequences from a GFF3 + genome.

Usage: eukan gff3toseq [OPTIONS]

Required input:
  -g, --genome PATH                Genome FASTA. [required]
  -i, --gff3 PATH                  GFF3 with gene models. [required]

Pipeline parameters:
  --output-format [protein|cdna]   Output sequence type. [default: protein]
  -c, --code INTEGER               NCBI genetic code table number. [default: 1]

Output options:
  -o, --output-file FILENAME       Write FASTA to this file ('-' for stdout).
                                   [default: stdout]

eukan db-fetch

Download reference databases. Pfam is always fetched; the homology DB to pair it with is chosen by --homology-db.

Usage: eukan db-fetch [OPTIONS]

Options:
  -o, --output-dir PATH          Directory to download into. [default: databases]
  --homology-db [uniprot|kofam]  Homology DB to fetch alongside Pfam.
                                 [default: uniprot]
  -f, --force                    Re-download even if databases are up to date.
  -d, --database [uniprot|pfam|kofam|ko_list]
                                 Specific database(s) to fetch. Overrides
                                 --homology-db when given.

Downloads and prepares:

  • Pfam-A.hmm — Pfam HMM profiles (decompressed and pressed for hmmscan). Always fetched.
  • uniprot_sprot.faa — UniProt-SwissProt protein sequences (--homology-db uniprot, default).
  • kofam_eukaryote.hmm + ko_list.tsv — KOfam profiles and per-KO bit-score thresholds (--homology-db kofam). The fetcher downloads KofamKOALA's profiles.tar.gz and ko_list.gz from https://www.genome.jp/ftp/db/kofam/, filters profiles via the shipped eukaryote.hal list (~16k of ~27k KOs), concatenates them into a single HMM file, and presses it. ko_list is the per-KO metadata table used by func-annot --homology-db kofam; see the eukan func-annot section for the KofamKOALA citation.

Each database has an entry in databases/.manifest.json (md5 + source URL + download date); re-running eukan db-fetch is a no-op when files match the manifest. Use -d <name> for a targeted refresh of a single file (e.g. -d pfam to refresh only Pfam without touching the homology DB).

eukan compare

Compare predicted gene models against a reference or previous annotation to assess annotation quality. Reports gene-level classification (exact, inexact, missing, merged, fragmented, novel), subfeature-level metrics (mRNA, CDS, intron), and overlap-based sensitivity/specificity/F1 scores. Repeat --predicted to compare several predictions against the same reference and tabulate, per gene-level classification, which subsets of predictions agreed.

Usage: eukan compare [OPTIONS]

Required input:
  -r, --reference PATH       Reference GFF3 file.
  -p, --predicted PATH       Predicted GFF3 file. Repeat to compare multiple
                             predictions against the same reference.

Pipeline parameters:
  -L, --label TEXT           Short label per --predicted (must appear once
                             per --predicted, or be omitted to use file
                             stems).

Output options:
  -o, --output-file PATH     Write per-feature details to a TSV file. In
                             multi-prediction mode a leading 'prediction'
                             column is prepended.

The classification system and metrics are further described in the paper referenced in Citation. Gene-level classifications:

  • exact: prediction coordinates match reference exactly.
  • inexact: prediction overlaps a single reference with boundary differences.
  • missing: reference gene with no overlapping prediction (false negative).
  • merged: prediction spans 2+ reference genes.
  • fragmented: 2+ predictions each cover a single reference gene.
  • novel: prediction with no reference overlap (possibly false positive).

For matched features, reports overlap-based sensitivity (overlap / reference length), specificity (overlap / prediction length), and F1 score. Boundary differences (5' and 3') are reported for inexact matches.

# Single prediction
eukan compare -r reference.gff3 -p predicted.gff3

# Single prediction with per-feature TSV
eukan compare -r reference.gff3 -p predicted.gff3 -o details.tsv

# Multiple predictions against the same reference
eukan compare -r reference.gff3 \
    -p eukan.gff3 -p braker.gff3 -p maker.gff3 \
    -L Eukan -L Braker -L Maker \
    -o details.tsv

Multi-prediction mode

When two or more --predicted inputs are supplied, the report adds a comparative summary on top of the per-prediction breakdowns:

  • F1 by level / prediction — count-based F1 at gene, mRNA, CDS, and intron level for each prediction, side by side.
  • Per-class powerset — for each gene-level classification (match = exact|inexact, missing, merged, fragmented), and for each reference gene, records the sorted tuple of prediction labels whose classification of that gene was that class, then tallies these tuples. Counts within a class sum to the number of reference genes; the (none) row counts genes no prediction assigned to that class. Full enumeration up to N=6 predictions; condensed (shared-by-all / shared-by-none / uniquely-per-pred) above.

--label defaults to each prediction's filename stem; if two predictions share a stem they are auto-numbered with a warning. Use --label to set explicit names; counts must match --predicted.

The N=1 path is unchanged from earlier releases — the comparative section is omitted entirely and the per-feature TSV keeps the legacy schema.

eukan check

Verify Python dependencies, external tools, and databases.

Usage: eukan check [OPTIONS]

Options:
  --for [annotate|assemble|mask-repeats|func-annot|db-fetch|prep-submission]
      Only check tools needed by these subcommands. If omitted, check all.
  --db-dir PATH   Database directory to check. [default: databases]

Checks Python dependencies, probes each external tool with a version/help command, and verifies database integrity. Exits 0 if all checks pass, 1 if any fail.

# Check everything
eukan check

# Check only what's needed for annotation
eukan check --for annotate

# Check multiple subcommands
eukan check --for annotate --for assemble

Example output:

Checked 14 external tools:

  12 tools OK:
    ✓ samtools                       samtools 1.20
    ✓ AUGUSTUS                        AUGUSTUS (3.5.0)
    ...

  2 tools MISSING or BROKEN:
    ✗ codingquarry                   `CodingQuarry` not found on PATH; env not set: $QUARRY_PATH
      used by: annotate
    ✗ fitild                         `fitild` not found on PATH
      used by: annotate
      hint: Build from source: git clone https://github.com/ogotoh/fitild ...

Pipeline Overview

The annotation pipeline (eukan annotate) runs the following steps:

  1. ORF finding: Identify ORFs in transcript assemblies (if provided). Uses the configured genetic code (-c/--code) so that alternative stop codons (e.g., code 6 where TAA/TAG encode glutamine) are handled correctly.
  2. GeneMark: Ab initio gene prediction (ES mode, or ET mode if RNA-seq intron hints are available with >= 150 introns). Passes --gcode for non-standard genetic codes (codes 6 and 26).
  3. Protein alignment: Spliced alignment via spaln (intron-rich genomes, > 25% introns/gene) or GenomeThreader (intron-poor). See Protein alignment modes.
  4. AUGUSTUS: Train species-specific parameters from concordant GeneMark/protein models, then predict genes using protein + RNA-seq hints. Non-canonical splice sites (e.g., AT-AC) are allowed automatically when supported by sufficient junction evidence from STAR; --splice-permissive lowers the evidence thresholds.
  5. SNAP: Train and predict (all kingdoms). CodingQuarry also runs for fungus/protist genomes.
  6. EVidenceModeler: Build weighted consensus gene models from all evidence sources.
  7. PASA UTRs: Add UTR annotations and model alternative splicing from the transcriptome database.
  8. Final output: Assign sequential locus tags and correct CDS phases. Non-overlapping transcript ORFs not captured by EVM are patched into the final model set.

Output: final.gff3 in the working directory.

Protein alignment modes

spaln protein alignment supports two modes for modeling intron structure:

Default (fitild): Builds an intron length distribution from GeneMark predictions and feeds it to spaln via the -yI parameter. This is the established approach and requires the fitild tool.

Species-specific parameters (--spsp): Uses transcript data to build full species-specific spaln parameters (splice site models, intron potential, codon usage) via spaln's make_eij.pl and make_ssp.pl scripts. This produces richer alignment parameters than fitild alone, at the cost of additional computation. Requires transcript data (from eukan assemble or provided via --transcripts-fasta).

# Default mode (fitild)
eukan annotate -g genome.fasta -p proteins.fasta

# Species-specific parameter mode (experimental)
eukan annotate -g genome.fasta -p proteins.fasta --spsp

When --spsp is used, protein alignment results are written to a separate directory (prot_align_ssp/) so both modes can coexist for comparison.

Run tracking and resume

All pipelines write to a shared eukan-run.json manifest in the working directory, tracking per-step status, timing, and output checksums. This enables:

  • Resume: Re-running a subcommand skips steps that already completed.
  • Selective re-run: Use --run-* flags to force specific steps to re-execute.
  • Integrity checking: On startup, completed steps are validated (output exists and is non-empty). If a discrepancy is found, the pipeline reports the issue and suggests the --run-* flag to fix it.
  • Progress monitoring: eukan status reads the manifest and shows progress across all pipelines.
# View run status
eukan status

# Re-run only protein alignment
eukan annotate -g genome.fasta -p proteins.fasta --run-prot-align

# Re-run only PASA in assembly
eukan assemble -g genome.fasta -l left.fq -r right.fq -P

Testing

Unit tests

poetry run pytest tests/ -v

Unit tests cover GFF3 parsing/serialization, genomic interval operations, ORF finding, configuration validation, run manifest tracking, database integrity checks, and CLI wiring. They run without external tools or network access.

Pipeline integration test

A development CLI at tests/run_pipeline.py drives a full end-to-end pipeline run using S. pombe chromosome III as the test organism.

Prerequisites

  • All external tools installed (Docker or conda environment; verify with eukan check)
  • NCBI datasets CLI and SRA Toolkit on PATH (for downloading test data)

When using Docker, build and use the dev image (eukan-dev) which includes the NCBI datasets CLI:

docker build -t eukan-dev -f docker/Dockerfile.dev .

1. Download test data

python tests/run_pipeline.py setup-test-data [-o tests/data]

Downloads from NCBI:

  • Genome: S. pombe chromosome III (NC_003424.3, ~2.5 Mbp)
  • Proteins: 10 close neighbor proteomes via datasets
  • RNA-seq: 5 SRA paired-end runs via prefetch + fasterq-dump

Accession lists live in tests/data/*.txt and are never deleted by cleanup. Downloads are idempotent.

2. Run the pipeline

# Full run: assembly + annotation (default kingdom: fungus)
python tests/run_pipeline.py test-pipeline --kingdom fungus -n 8

# Protein-only: skip transcriptome assembly
python tests/run_pipeline.py test-pipeline --protein-only -n 8

# Custom directories
python tests/run_pipeline.py test-pipeline -d tests/data -w tests/pipeline-run

The test pipeline runs:

  1. Transcriptome assembly: STAR read mapping, Trinity (genome-guided + de novo), PASA alignment
  2. Genome annotation: GeneMark, protein alignment (spaln/gth), AUGUSTUS, SNAP, EVM consensus

If assembly fails, it falls back to protein-only annotation automatically.

Output lands in tests/pipeline-run/ with one subdirectory per pipeline step (assemble/, annotate/, func-annot/, submission/). View run details with eukan status -d tests/pipeline-run/annotate. The end-of-run summary reports the gene/mRNA counts from final.gff3 plus the percentage of mRNAs that received a UniProt or Pfam inference.

3. Clean up

# Remove pipeline outputs only
python tests/run_pipeline.py clean-test-data

# Remove outputs + downloaded data (genome, proteins, reads)
python tests/run_pipeline.py clean-test-data --all

All three subcommands accept -h for detailed help.

Citation

If you use eukan, please cite:

Sarrasin M, Burger G, Lang BF. Eukan: a fully automated nuclear genome annotation pipeline for less studied and divergent eukaryotes. NAR Genomics and Bioinformatics. 2026 Mar;8(1):lqag003. doi:10.1093/nargab/lqag003

License

See LICENSE.md.

About

A eukaryotic genome annotation pipeline

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors