Serotype O111 belongs to the ”Big Six” Escherichia coli taxa causing food poisoning in the US and beyond. In this tutorial we analyze the subtype with H8 flagella antigen in the context of all E. coli genomes with a view to designing serotype-specific diagnostic markers. This analysis is part of a forthcoming manuscript entitled Sorting Bacterial Genomes Below the Species Rank.
Beatriz Vieira Mourato, Sara-Lena Welk, Fabian Klötzl, and Bernhard Haubold
This tutorial has six main dependencies:
- Neighbors for finding target and neighbor genomes
- `datasets` for downloading genomes
- Fur for finding unique genome regions
- Prim for designing PCR primers
- Biobox for general sequence manipulation
- the Unix tools `curl`, `bzip2`, and `zip` for downloading and decompressing files
If you are on a Debian system like WSL/Ubuntu, you can install these dependencies into `~/bin/` by executing from inside the `o111h8` repo
bash scripts/setup.shOnce that’s done, make sure `~/bin/` is in your path by running
source ~/.profileWe have tested this setup on our “minimal box” Docker container, mix.
Execute
make datato generate the directory `data` and calculate the Neighbors database inside it. This takes approximately 2.5 minutes, produces 166 warnings you can safely ignore, and results in four relevant files inside `data`,
- `neidb`, the Neighbors database calculated from the Genbank assemblies, Refseq assemblies, and the taxonomy database downloaded from the NCBI on 17th June 2026
- `eco.json`, the genome summaries of the 7884 E. coli genomes assembled to level “complete”
- `eco.nwk`, the tree of the 7386 complete E. coli genomes that passed the quality filter
- `sero.txt`, the serotypes of the 7386 complete E. coli genomes calculated with `ectyper`
Make the tutorial and change into it.
make tutorialThis now contains the scripts and data files for following the tutorial described in the doc.