Skip to content

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

License

Notifications You must be signed in to change notification settings

GenerTeam/GENERanno

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gener

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

  • 🤗 [2026-02-10] Our expert model for eukaryotic genome annotation GENERanno-eukaryote-1.2b-cds-annotator-preview is now available on HuggingFace!
  • 📑 [2025-06-05] Our paper is now available on bioRxiv!
  • 🤗 [2025-05-10] Our expert model for metagenomic annotation GENERanno-prokaryote-0.5b-cds-annotator is now available on HuggingFace!
  • 🤗 [2025-02-11] Our models GENERanno-prokaryote-0.5b-base, GENERanno-eukaryote-0.5b-base are now available on HuggingFace!

🔭 Overview

In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with GENERator in benchmark evaluations, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks, making them the top genomic foundation models in the field (2025-02).

Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.

Please note that the GENERanno is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!

In this repository, you will find the following model checkpoints:

Model Name Parameters Data Category Status
GENERanno-prokaryote-0.5b-base 0.5B 715B Prokaryote Available
GENERanno-prokaryote-0.5b-cds-annotator 0.5B 890B Prokaryote Available
GENERanno-eukaryote-0.5b-base 0.5B 386B Eukaryote Available
GENERanno-eukaryote-1.2b-cds-annotator-preview 1.2B 1T Eukaryote Available

🎯 Quick Start

Dependencies

  • Clone this repo, cd into it
git clone https://github.com/GenerTeam/GENERanno.git
cd GENERanno
  • Install requirements with Python 3.10
pip install -r requirements.txt

If your network cannot access huggingface.co normally, we recommend using the following mirror:

export HF_ENDPOINT=https://hf-mirror.com

Downstream

Coding DNA Sequence (CDS) Annotation

You can run CDS annotation on the cds-annotation dataset using the unified CLI interface below.

Basic usage
# Eukaryotic genome annotation
python src/tasks/downstream/cds_annotation.py --organism eukaryote

# Prokaryotic genome annotation
python src/tasks/downstream/cds_annotation.py --organism prokaryote

# Enable BF16 for faster inference (recommended if supported)
python src/tasks/downstream/cds_annotation.py --organism eukaryote --bf16
Custom input

By default, each --organism preset uses a built-in example input. You can override it with your own FASTA or Parquet file:

# Parquet input
python src/tasks/downstream/cds_annotation.py \
  --organism eukaryote \
  --input hf://datasets/GenerTeam/cds-annotation/examples/fly_GCF_000001215.4.parquet

# FASTA input
python src/tasks/downstream/cds_annotation.py \
  --organism prokaryote \
  --input hf://datasets/GenerTeam/cds-annotation/examples/Escherichia_coli_genome.fasta
Performance options
# Use all available GPUs (default)
python src/tasks/downstream/cds_annotation.py --organism eukaryote

# Use a specific number of GPUs
python src/tasks/downstream/cds_annotation.py --organism eukaryote --gpu_count ${NUM_GPUS}

# Enable BF16 for faster inference (recommended if supported)
python src/tasks/downstream/cds_annotation.py --organism eukaryote --bf16

Note: BF16 improves inference speed on supported hardware (e.g. A100) with minimal impact on accuracy.

Sequence Understanding (Classification/Regression)

To run the sequence understanding task on Gener Tasks, Prokaryotic Gener Tasks, NT Tasks, Genomic Benchmarks, DeepSTARR Enhancer, you can use the following arguments:

  • Gener Tasks / Prokaryotic Gener Tasks
    • --dataset_name GenerTeam/gener-tasks or --dataset_name GenerTeam/prokaryotic-gener-tasks
    • --subset_name gene_classification or --subset_name taxonomic_classification or ...
  • NT Tasks
    • --dataset_name InstaDeepAI/nucleotide_transformer_downstream_tasks_revised
    • --subset_name H2AFZ or --subset_name H3K27ac or ...
  • Genomic Benchmarks
    • --dataset_name katarinagresova/Genomic_Benchmarks_demo_human_or_worm or --dataset_name katarinagresova/Genomic_Benchmarks_human_ocr_ensembl or ...
  • DeepSTARR Enhancer Activity
    • --dataset_name GenerTeam/DeepSTARR-enhancer-activity
    • --problem_type regression

on following command:

# Using single GPU
python src/tasks/downstream/sequence_understanding.py \
    --model_name GenerTeam/GENERator-eukaryote-1.2b-base \
    --dataset_name ${DATASET_NAME} \
    --subset_name ${SUBSET_NAME} \
    --batch_size ${BATCH_SIZE} \
    --problem_type ${PROBLEM_TYPE} \
    --main_metrics ${MAIN_METRICS}

# Using multiple GPUs on single node (DDP)
torchrun --nnodes=1 \
    --nproc_per_node=${NUM_GPUS} \
    --rdzv_backend=c10d \
    src/tasks/downstream/sequence_understanding.py

# Using multiple GPUs on multiple nodes (DDP)
torchrun --nnodes=${NUM_NODES} \
    --nproc_per_node=${NUM_GPUS_PER_NODE} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    src/tasks/downstream/sequence_understanding.py

# Using DeepSpeed or Full Sharded Data Parallel (FSDP)
torchrun --nnodes=${NUM_NODES} \
    --nproc_per_node=${NUM_GPUS_PER_NODE} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    src/tasks/downstream/sequence_understanding.py \
    --distributed_type deepspeed # or fsdp

📚 Datasets

📜 Citation

@article{li2025generanno,
	author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
	title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
	elocation-id = {2025.06.04.656517},
	year = {2025},
	doi = {10.1101/2025.06.04.656517},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
	journal = {bioRxiv}
}

📈 Benchmark Performance

Sequence Understanding (Classification/Regression) — GENERanno-prokaryote-0.5b-base

benchmarks

Sequence Understanding (Classification/Regression) — GENERanno-eukaryote-0.5b-base

benchmarks

About

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages