GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

🤗 [2026-02-10] Our expert model for eukaryotic genome annotation GENERanno-eukaryote-1.2b-cds-annotator-preview is now available on HuggingFace!
📑 [2025-06-05] Our paper is now available on bioRxiv!
🤗 [2025-05-10] Our expert model for metagenomic annotation GENERanno-prokaryote-0.5b-cds-annotator is now available on HuggingFace!
🤗 [2025-02-11] Our models GENERanno-prokaryote-0.5b-base, GENERanno-eukaryote-0.5b-base are now available on HuggingFace!

🔭 Overview

In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with GENERator in benchmark evaluations, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks, making them the top genomic foundation models in the field (2025-02).

Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.

Please note that the GENERanno is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!

In this repository, you will find the following model checkpoints:

Model Name	Parameters	Data	Category	Status
`GENERanno-prokaryote-0.5b-base`	0.5B	715B	Prokaryote	Available
`GENERanno-prokaryote-0.5b-cds-annotator`	0.5B	890B	Prokaryote	Available
`GENERanno-eukaryote-0.5b-base`	0.5B	386B	Eukaryote	Available
`GENERanno-eukaryote-1.2b-cds-annotator-preview`	1.2B	1T	Eukaryote	Available

🎯 Quick Start

Dependencies

Clone this repo, cd into it

git clone https://github.com/GenerTeam/GENERanno.git
cd GENERanno

Install requirements with Python 3.10

pip install -r requirements.txt

If your network cannot access huggingface.co normally, we recommend using the following mirror:
export HF_ENDPOINT=https://hf-mirror.com

Downstream

Coding DNA Sequence (CDS) Annotation

You can run CDS annotation on the cds-annotation dataset using the unified CLI interface below.

Basic usage

# Eukaryotic genome annotation
python src/tasks/downstream/cds_annotation.py --organism eukaryote

# Prokaryotic genome annotation
python src/tasks/downstream/cds_annotation.py --organism prokaryote

# Enable BF16 for faster inference (recommended if supported)
python src/tasks/downstream/cds_annotation.py --organism eukaryote --bf16

Custom input

By default, each --organism preset uses a built-in example input. You can override it with your own FASTA or Parquet file:

# Parquet input
python src/tasks/downstream/cds_annotation.py \
  --organism eukaryote \
  --input hf://datasets/GenerTeam/cds-annotation/examples/fly_GCF_000001215.4.parquet

# FASTA input
python src/tasks/downstream/cds_annotation.py \
  --organism prokaryote \
  --input hf://datasets/GenerTeam/cds-annotation/examples/Escherichia_coli_genome.fasta

Performance options

# Use all available GPUs (default)
python src/tasks/downstream/cds_annotation.py --organism eukaryote

# Use a specific number of GPUs
python src/tasks/downstream/cds_annotation.py --organism eukaryote --gpu_count ${NUM_GPUS}

# Enable BF16 for faster inference (recommended if supported)
python src/tasks/downstream/cds_annotation.py --organism eukaryote --bf16

Note: BF16 improves inference speed on supported hardware (e.g. A100) with minimal impact on accuracy.

Sequence Understanding (Classification/Regression)

To run the sequence understanding task on Gener Tasks, Prokaryotic Gener Tasks, NT Tasks, Genomic Benchmarks, DeepSTARR Enhancer, you can use the following arguments:

Gener Tasks / Prokaryotic Gener Tasks
- --dataset_name GenerTeam/gener-tasks or --dataset_name GenerTeam/prokaryotic-gener-tasks
- --subset_name gene_classification or --subset_name taxonomic_classification or ...
NT Tasks
- --dataset_name InstaDeepAI/nucleotide_transformer_downstream_tasks_revised
- --subset_name H2AFZ or --subset_name H3K27ac or ...
Genomic Benchmarks
- --dataset_name katarinagresova/Genomic_Benchmarks_demo_human_or_worm or --dataset_name katarinagresova/Genomic_Benchmarks_human_ocr_ensembl or ...
DeepSTARR Enhancer Activity
- --dataset_name GenerTeam/DeepSTARR-enhancer-activity
- --problem_type regression

on following command:

# Using single GPU
python src/tasks/downstream/sequence_understanding.py \
    --model_name GenerTeam/GENERator-eukaryote-1.2b-base \
    --dataset_name ${DATASET_NAME} \
    --subset_name ${SUBSET_NAME} \
    --batch_size ${BATCH_SIZE} \
    --problem_type ${PROBLEM_TYPE} \
    --main_metrics ${MAIN_METRICS}

# Using multiple GPUs on single node (DDP)
torchrun --nnodes=1 \
    --nproc_per_node=${NUM_GPUS} \
    --rdzv_backend=c10d \
    src/tasks/downstream/sequence_understanding.py

# Using multiple GPUs on multiple nodes (DDP)
torchrun --nnodes=${NUM_NODES} \
    --nproc_per_node=${NUM_GPUS_PER_NODE} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    src/tasks/downstream/sequence_understanding.py

# Using DeepSpeed or Full Sharded Data Parallel (FSDP)
torchrun --nnodes=${NUM_NODES} \
    --nproc_per_node=${NUM_GPUS_PER_NODE} \
    --rdzv_backend=c10d \
    --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    src/tasks/downstream/sequence_understanding.py \
    --distributed_type deepspeed # or fsdp

📚 Datasets

📜 Citation

@article{li2025generanno,
	author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
	title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
	elocation-id = {2025.06.04.656517},
	year = {2025},
	doi = {10.1101/2025.06.04.656517},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
	journal = {bioRxiv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
configs		configs
figures		figures
src/tasks/downstream		src/tasks/downstream
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

🔭 Overview

🎯 Quick Start

Dependencies

Downstream

Coding DNA Sequence (CDS) Annotation

Basic usage

Custom input

Performance options

Sequence Understanding (Classification/Regression)

📚 Datasets

📜 Citation

📈 Benchmark Performance

Sequence Understanding (Classification/Regression) — `GENERanno-prokaryote-0.5b-base`

Sequence Understanding (Classification/Regression) — `GENERanno-eukaryote-0.5b-base`

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

GenerTeam/GENERanno

Folders and files

Latest commit

History

Repository files navigation

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

📰 News

🔭 Overview

🎯 Quick Start

Dependencies

Downstream

Coding DNA Sequence (CDS) Annotation

Basic usage

Custom input

Performance options

Sequence Understanding (Classification/Regression)

📚 Datasets

📜 Citation

📈 Benchmark Performance

Sequence Understanding (Classification/Regression) — GENERanno-prokaryote-0.5b-base

Sequence Understanding (Classification/Regression) — GENERanno-eukaryote-0.5b-base

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Sequence Understanding (Classification/Regression) — `GENERanno-prokaryote-0.5b-base`

Sequence Understanding (Classification/Regression) — `GENERanno-eukaryote-0.5b-base`

Packages