text2mol

Replication of Text2Mol: Cross-Modal Molecular Retrieval with Natural Language Queries

Quick Start

Environment Setup

Create a new conda environment for the project:

# Create conda environment
conda env create -f code/requirements.yaml

# Activate environment
conda activate text2mol

# Update environment (if needed)
conda env update -f code/requirements.yaml --prune

Training

Train the Text2Mol model:

python code/main.py --data data --output_path test_output --model MLP --epochs 40 --batch_size 32

Evaluation

Rank embeddings and evaluate performance:

# Rank single model outputs
python code/ranker.py test_output/embeddings --train --val --test

# Rank ensemble of multiple models
python code/ensemble.py test_output/embeddings GCN_outputs/embeddings --train --val --test

Testing

Run example queries with a trained model:

python code/test_example.py test_output/embeddings/ data/ test_output/CHECKPOINT.pt

Dataset

The project uses the ChEBI-20 dataset located in the data/ directory. The dataset includes:

Training/Validation/Test splits: training.txt, val.txt, test.txt
Molecular graphs: mol_graphs.zip containing graph representations
Token embeddings: token_embedding_dict.npy for molecular substructure tokens
Corpus data: ChEBI_defintions_substructure_corpus.cp with tokenized descriptions

Data Format

Each data file contains:

CID: PubChem Compound ID
Mol2Vec embeddings: Pre-computed molecular embeddings
ChEBI descriptions: Natural language descriptions of molecules

Model Architecture

The implementation includes three model variants:

Model	Description
MLP	Multi-layer perceptron for embedding projection
GCN	Graph Convolutional Network for molecular representation
Attention	Attention-based model for cross-modal learning

Code Structure

File	Purpose
`main.py`	Main training script
`models.py`	Model architecture definitions
`dataloaders.py`	Data loading and preprocessing
`losses.py`	Loss function implementations
`ranker.py`	Embedding ranking and evaluation
`ensemble.py`	Ensemble model evaluation
`extract_embeddings.py`	Embedding extraction utilities
`test_example.py`	Interactive testing interface
`ranker_threshold.py`	Threshold analysis and visualization

Usage Examples

Extract Embeddings

python code/extract_embeddings.py \
    --data data \
    --output_path embedding_output_dir \
    --checkpoint test_output/CHECKPOINT.pt \
    --model MLP \
    --batch_size 32

Analyze Threshold Performance

python code/ranker_threshold.py test_output/embeddings \
    --train --val --test \
    --output_file threshold_analysis.png

Video Presentation

Watch the project presentation: YouTube Video

Dependencies

Key dependencies include:

PyTorch 1.11.0: Deep learning framework
PyTorch Geometric: Graph neural networks
Transformers 4.15.0: Pre-trained language models
NumPy, Pandas: Data manipulation
Matplotlib: Visualization
Scikit-learn: Machine learning utilities

Citation

If you use this implementation in your research, please cite the original paper:

@inproceedings{edwards2021text2mol,
  title={Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries},
  author={Edwards, Carl and Zhai, ChengXiang and Ji, Heng},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={595--607},
  year={2021},
  url = {https://aclanthology.org/2021.emnlp-main.47/}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
img		img
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
replication.ipynb		replication.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text2mol

Quick Start

Environment Setup

Training

Evaluation

Testing

Dataset

Data Format

Model Architecture

Code Structure

Usage Examples

Extract Embeddings

Analyze Threshold Performance

Video Presentation

Dependencies

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

text2mol

Quick Start

Environment Setup

Training

Evaluation

Testing

Dataset

Data Format

Model Architecture

Code Structure

Usage Examples

Extract Embeddings

Analyze Threshold Performance

Video Presentation

Dependencies

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages