If you are using GENERator for sequence generation, please ensure that the length of each input sequence is a multiple of 6. This can be achieved by either:
- Padding the sequence on the left with
'A'(left padding); - Truncating the sequence from the left (left truncation).
This requirement arises because GENERator employs a 6-mer tokenizer. If the input sequence length is not a multiple of 6, the tokenizer will append an '<oov>' (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated 'AAAAAA'.
We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
- 📑 [2026-01-31] GENERator-v2 technical report is now available on bioRxiv!
- 🤗 [2025-11-05] GENERator-v2 series are now available on HuggingFace!
- 📑 [2025-02-12] Our paper is now available on arXiv!
- 🤗 [2025-02-11] Our models
GENERator-eukaryote-1.2b-base,GENERator-eukaryote-3b-baseare now available on HuggingFace!
In this repository, we present GENERator, a collection of generative genomic foundation models utilizing the transformer decoder architecture, trained on expansive DNA datasets derived from the RefSeq database. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks.
Beyond benchmark performance, the GENERator adheres to the central dogma of molecular biology, accurately generating protein-coding DNA sequences that produce proteins structurally analogous to known families. Moreover, the GENERator showcases significant promise in sequence optimization, particularly in the design of enhancer sequences that regulate gene expression during various biological stages, highlighting its potential for a series of biologically significant tasks. Our findings position the GENERator as a vital resource for genomic research and biotechnological advancement. By enhancing our capability to interpret and predict genomic sequences, the GENERator paves the way for profound improvements in our understanding of complex biological systems and the development of precise genomic interventions.
For more technical details, please refer to our paper on arXiv.
In this repository, you will find the following model checkpoints:
| Model Name | Parameters | Data | Category | Status |
|---|---|---|---|---|
GENERator-eukaryote-1.2b-base |
1.2B | 386B | Eukaryote | Available |
GENERator-eukaryote-3b-base |
3B | 386B | Eukaryote | Available |
GENERator-v2-eukaryote-1.2b-base |
1.2B | 422B | Eukaryote | Available |
GENERator-v2-eukaryote-3b-base |
3B | 422B | Eukaryote | Available |
GENERator-v2-prokaryote-1.2b-base |
1.2B | 515B | Prokaryote | Available |
GENERator-v2-prokaryote-3b-base |
3B | 515B | Prokaryote | Available |
- Clone this repo, cd into it
git clone https://github.com/GenerTeam/GENERator.git
cd GENERator- Install requirements with Python 3.10
pip install -r requirements.txt- To support much longer sequence lengths during pre‐training, we recommend installing both Liger Kernel and FlashAttention. Remarkably, using sliding‐window attention on just one A100 GPU, we extended the base‐pair (bp) context length to 1 million.
pip install liger-kernel
pip install flash-attn --no-build-isolationIf your network cannot access huggingface.co normally, we recommend using the following mirror:
export HF_ENDPOINT=https://hf-mirror.com
To run the variant effect prediction task on clinvar, you can use the following command:
# FP32 (default)
python src/tasks/downstream/variant_effect_prediction.py
# BF16 for faster inference (recommended if supported)
python src/tasks/downstream/variant_effect_prediction.py --bf16Note: BF16 provides faster inference with minimal accuracy impact on supported hardware.
This task (previously called Next Kmer Prediction) evaluates the model ability to recover DNA sequences.
To run sequence recovery on our datasets, you can use the following command:
# FP32 (default)
python src/tasks/downstream/sequence_recovery.py
# BF16 for faster inference (recommended if supported)
python src/tasks/downstream/sequence_recovery.py --bf16Note: BF16 provides faster inference with minimal accuracy impact on supported hardware.
To run the sequence understanding task on Gener Tasks, NT Tasks, Genomic Benchmarks, DeepSTARR Enhancer, you can use the following arguments:
- Gener Tasks
--dataset_name GenerTeam/gener-tasks--subset_name gene_classificationor--subset_name taxonomic_classification
- NT Tasks
--dataset_name InstaDeepAI/nucleotide_transformer_downstream_tasks_revised--subset_name H2AFZor--subset_name H3K27acor ...
- Genomic Benchmarks
--dataset_name katarinagresova/Genomic_Benchmarks_demo_human_or_wormor--dataset_name katarinagresova/Genomic_Benchmarks_human_ocr_ensemblor ...
- DeepSTARR Enhancer Activity
--dataset_name GenerTeam/DeepSTARR-enhancer-activity--problem_type regression
on following command:
# Using single GPU
python src/tasks/downstream/sequence_understanding.py \
--model_name GenerTeam/GENERator-v2-eukaryote-1.2b-base \
--dataset_name ${DATASET_NAME} \
--subset_name ${SUBSET_NAME} \
--batch_size ${BATCH_SIZE} \
--problem_type ${PROBLEM_TYPE} \
--main_metrics ${MAIN_METRICS}
# Using multiple GPUs on single node (DDP)
torchrun --nnodes=1 \
--nproc_per_node=${NUM_GPUS} \
--rdzv_backend=c10d \
src/tasks/downstream/sequence_understanding.py \
--dataset_name ${DATASET_NAME}
# Using multiple GPUs on multiple nodes (DDP)
torchrun --nnodes=${NUM_NODES} \
--nproc_per_node=${NUM_GPUS_PER_NODE} \
--rdzv_backend=c10d \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
src/tasks/downstream/sequence_understanding.py \
--dataset_name ${DATASET_NAME}
# Using DeepSpeed or Full Sharded Data Parallel (FSDP)
torchrun --nnodes=${NUM_NODES} \
--nproc_per_node=${NUM_GPUS_PER_NODE} \
--rdzv_backend=c10d \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
src/tasks/downstream/sequence_understanding.py \
--distributed_type deepspeed \ # or fsdp
--dataset_name ${DATASET_NAME}To fine-tune a GENERator base model for DNA sequence generation (e.g., DeepSTARR Enhancer, Histone coding DNA sequence (CDS), Cytochrome P450 CDS), you can use the following arguments:
Data input (choose exactly one):
- HuggingFace datasets
- DeepSTARR Enhancer Activity:
--dataset_name GenerTeam/DeepSTARR-enhancer-activity - Histone CDS:
--dataset_name GenerTeam/histone-cds - Cytochrome P450 CDS:
--dataset_name GenerTeam/cytochrome-p450-cds - Optional:
--subset_name ${SUBSET_NAME},--dataset_split train(default),--sequence_col sequence(default).
- DeepSTARR Enhancer Activity:
- Local parquet
--parquet_path ${PARQUET_PATH}(file or directory of parquet files)- Optional:
--sequence_col sequence(default).
on following command:
# Using single GPU
python src/tasks/downstream/fine_tuning.py \
--model_name GenerTeam/GENERator-eukaryote-1.2b-base \
--dataset_name ${DATASET_NAME} \
--epochs ${NUM_EPOCHS} \
--batch_size ${BATCH_SIZE}
# Using multiple GPUs on single node
torchrun --nnodes=1 \
--nproc_per_node=${NUM_GPUS} \
--rdzv_backend=c10d \
src/tasks/downstream/fine_tuning.py \
--distributed_type fsdp \ # or deepspeed/ddp
--dataset_name ${DATASET_NAME}
# Using multiple GPUs on multiple nodes
torchrun --nnodes=${NUM_NODES} \
--nproc_per_node=${NUM_GPUS_PER_NODE} \
--rdzv_backend=c10d \
--rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
src/tasks/downstream/fine_tuning.py \
--distributed_type fsdp \ # or deepspeed/ddp
--dataset_name ${DATASET_NAME}Note: Replace --dataset_name ${DATASET_NAME} with --parquet_path ${PARQUET_PATH} for local parquet.
@article {li2026generator2,
author = {Li, Qiuyi and Zhan, Zhihao and Feng, Shikun and Zhu, Yiheng and He, Yuan and Wu, Wei and Shi, Zhenghang and Wang, Shengjie and Hu, Zongyong and Yang, Zhao and Li, Jiaoyang and Tang, Jian and Liu, Haiguang and Qin, Tao},
title = {Functional In-Context Learning in Genomic Language Models with Nucleotide-Level Supervision and Genome Compression},
elocation-id = {2026.01.27.702015},
year = {2026},
doi = {10.64898/2026.01.27.702015},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2026/01/29/2026.01.27.702015},
journal = {bioRxiv}
}
@article{wu2025generator,
title={GENERator: a long-context generative genomic foundation model},
author={Wu, Wei and Li, Qiuyi and Li, Mingyang and Fu, Kun and Feng, Fuli and Ye, Jieping and Xiong, Hui and Wang, Zheng},
journal={arXiv preprint arXiv:2502.07272},
year={2025}
}





