This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
KG-Microbe is a knowledge graph construction project for microbial traits and beyond. It integrates multiple data sources (BacDive, MediaDive, UniProt, CTD, etc.) with ontologies (NCBITaxon, ChEBI, GO, ENVO, etc.) to create a comprehensive knowledge graph of microbial organisms, their traits, growth media, metabolic pathways, and associated chemical compounds.
The project follows a three-stage pipeline: Download → Transform → Merge
pip install poetry
poetry install# Download all data sources (configured in download.yaml)
poetry run kg download
# Transform downloaded data into KG format (TSV: nodes.tsv, edges.tsv)
poetry run kg transform
# Transform specific sources only
poetry run kg transform -s bacdive -s mediadive
# Merge all transformed graphs (configured in merge.yaml or merge.minimal.yaml)
poetry run kg merge -y merge.yaml# Run all quality checks before committing (REQUIRED before every commit)
poetry run tox
# Run specific test suites
poetry run pytest # Run all tests
poetry run pytest tests/test_file.py # Run specific test file
# Individual quality checks
poetry run tox -e format # Format code (black + ruff --fix)
poetry run tox -e lint # Check code style
poetry run tox -e codespell-write # Fix spelling errors
poetry run tox -e docstr-coverage # Check documentation coveragemake run-summary # Generate node and edge counts by category# Generate holdout sets for ML training (splits graph into train/test/validation)
poetry run kg holdouts -n data/merged/nodes.tsv -e data/merged/edges.tsv -o data/holdouts/
# Run SPARQL queries against the knowledge graph
poetry run kg query -y queries/sparql/example_query.yaml -o data/queries/make neo4j-upload # Upload merged KG to local Neo4j instance-
Download (
kg_microbe/download.py)- Downloads resources defined in
download.yaml - Sources stored in
data/raw/ - Uses
kghub-downloaderlibrary
- Downloads resources defined in
-
Transform (
kg_microbe/transform.py)- Each data source has its own transform class in
kg_microbe/transform_utils/[source_name]/ - All transform classes inherit from
Transformbase class (transform_utils/transform.py) - Each transform produces
nodes.tsvandedges.tsvindata/transformed/[source_name]/ - Node/edge headers defined in base
Transformclass using constants fromconstants.py
- Each data source has its own transform class in
-
Merge (
kg_microbe/merge_utils/merge_kg.py)- Uses KGX library to merge transformed graphs
- Configuration in
merge.yamlormerge.minimal.yaml - Outputs to
data/merged/as TSV (optionally tar.gz compressed) - Generates graph statistics in
merged_graph_stats.yaml
All transform classes follow this pattern:
- Located in
kg_microbe/transform_utils/[source_name]/ - Class name:
[SourceName]Transform(e.g.,BacDiveTransform,MediaDiveTransform) - Registered in
DATA_SOURCESdict inkg_microbe/transform.py - Implement
run()method from baseTransformclass - Output standard KGX TSV format (nodes.tsv, edges.tsv)
Key transform sources (currently active in DATA_SOURCES):
- bacdive: Bacterial diversity data (taxon traits, growth media, metabolic properties)
- mediadive: Growth media composition data
- madin_etal: Condensed bacterial/archaeal traits from literature
- bactotraits: Bacterial trait data
- ontologies: OBO ontologies (ENVO, ChEBI, GO, NCBITaxon, MONDO, HP, EC)
- rhea_mappings: Rhea reaction mappings to GO and EC
Additional available transforms (commented out in DATA_SOURCES):
- uniprot_functional_microbes: Protein data for functional microbes
- ctd: Comparative Toxicogenomics Database
- disbiome: Microbiome-disease associations
- wallen_etal: Additional bacterial trait data
- uniprot_human: Human protein data
kg_microbe/run.py: CLI entry point with Click commandskg_microbe/transform_utils/constants.py: Standard column names (ID_COLUMN, CATEGORY_COLUMN, etc.)kg_microbe/transform_utils/custom_curies.yaml: Custom CURIE prefix mappingskg_microbe/transform_utils/translation_table.yaml: Entity type translationspyproject.toml: Poetry configuration, ruff/black settings
download.yaml → data/raw/[source].json/csv/owl
↓
Transform Classes
↓
data/transformed/[source]/nodes.tsv
data/transformed/[source]/edges.tsv
↓
merge.yaml
↓
data/merged/merged-kg.tar.gz
The KG construction process is computationally intensive, particularly:
- Trimming NCBI Taxonomy
- Processing microbial UniProt datasets (for KG-Microbe-Function and KG-Microbe-Biomedical-Function)
Successful execution may require significant memory resources (e.g., >500 GB RAM for certain operations).
MetaTraits transforms (both metatraits and metatraits_gtdb) now support parallel processing to significantly reduce runtime:
Performance:
- Sequential mode: 5-8 hours for GTDB metatraits (85K taxa)
- Parallel mode: 1.5-2.5 hours (2-3x speedup)
Configuration:
- Auto-enabled: Multiprocessing is ON by default when either (a) 2+ input files exist, or (b) a single large input file is internally chunk-split
- Auto-scaled: Worker count automatically adjusted based on CPU cores and available memory (3GB per worker)
- Environment variables:
# Disable multiprocessing METATRAITS_MULTIPROCESSING=false poetry run kg transform -s metatraits # Override worker count METATRAITS_WORKERS=4 poetry run kg transform -s metatraits
Resource requirements:
- Each worker needs ~3GB RAM (for OAK adapter + processing)
- Uses N-1 CPU cores by default (leaves 1 for system)
- Example: 8-core system with 24GB RAM → 4 parallel workers
ALWAYS run poetry run tox before every commit to ensure code quality. This runs all quality checks: format, lint, codespell, docstr-coverage, and tests.
Copy .env.example to .env and configure:
BACDIVE_USERNAME: BacDive API emailBACDIVE_PASSWORD: BacDive API password
- Transform classes:
[SourceName]Transformintransform_utils/[source_name]/[source_name].py - Source name constants: Uppercase in
transform_utils/constants.py(e.g.,BACDIVE = "bacdive") - Output files: Always
nodes.tsvandedges.tsvper source - Column names: Use constants from
constants.py(e.g.,SUBJECT_COLUMN,PREDICATE_COLUMN,OBJECT_COLUMN)
- Line length: 120 characters (ruff), 100 characters (black)
- Python: ≥3.10
- Linting: ruff with pydocstyle (D), pycodestyle (E), Pyflakes (F), isort (I), flake8-bandit (S)
- Type hints required for function signatures
- Docstrings required (checked by
docstr-coverage)
Tests in tests/ directory:
test_transform_class.py: Transform class teststest_transform_utils.py: Transform utility teststest_run.py: CLI command teststest_query.py: SPARQL query tests
Test resources in tests/resources/
- Create directory:
kg_microbe/transform_utils/[new_source]/ - Create transform class inheriting from
Transform - Implement
run()method to generate nodes.tsv and edges.tsv - Add constant to
constants.py:NEW_SOURCE = "new_source" - Register in
DATA_SOURCESdict inkg_microbe/transform.py - Add download entry to
download.yaml - Add merge entry to
merge.yaml
Edges must include:
subject: Subject node ID (with CURIE prefix)predicate: Biolink predicate (e.g.,biolink:related_to)object: Object node ID (with CURIE prefix)relation: RO or other relation ontology termprimary_knowledge_source: Source provenance
Nodes must include:
id: Unique CURIE identifiercategory: Biolink category (e.g.,biolink:OrganismTaxon,biolink:ChemicalEntity)name: Human-readable label- Other optional fields:
description,xref,synonym,provided_by