Defining bacterial speciesBacterial species status (or any species classification) can be a tricky thing to pin down. The kind of definition that is easy to use and intuitively familiar to most people for animals and plants (essentially, if the members of two populations have sex will there be offspring? If so, they're the same species) is a reproductive definition, but bacteria don't really have sex and reproduce in a way that is familiar to most people. As a result, we need a different kind of definition if we're going to talk about bacterial species at all.
|Rua do Instituto Bacteriologico, Lisbon|
Happily, there are many other suggestions for how we might define species, though they don't all necessarily agree with each other. Since genetic material, i.e. DNA, is typically more similar between related individuals - those that share a more recent common ancestor - we can use DNA similarity as a measure of 'phylogenetic species'. DNA-DNA hybridisation (DDH) has been used for about half a century as a practical way to measure this similarity. But this is a physical experiment, and only an approximate measure, whereas new DNA sequencing technologies enable us to read the entire genomes of bacteria quickly, and therefore calculate the similarity of two individuals' DNA with much greater accuracy, and often quite cheaply, too. So it is now practical to use computational comparisons of bacterial genome sequences to determine whether they are the same species.
Over the years, some rules of thumb have evolved for the amount of genetic similarity that implies that two bacteria belong to the same species. Though, to be honest, I much prefer the pan-genome concept as an organising principle. For DNA-DNA hybridisation this is often quoted as "70% DNA similarity" (e.g. here, and here). But if we're using bacterial genome sequences, we can use a precise level of DNA similarity, the Average Nucleotide Identity (ANI). A couple of years ago, it was determined that the overall DNA sequence similarity between two genome sequences at the boundaries between previously accepted species is about 95-96%.
But there's more than one way to measure sequence similarity. As usual.
Defining ANIIn their paper "DNA-DNA hybridization values and their relationship to whole-genome sequence similarities." (doi:10.1099/ijs.0.64483-0), Goris et al. introduced ANI as a substitute for DDH when determining prokaryotic species boundaries. But they used a method of determining ANI that struck me as odd. In their implementation, they slice one of the genomes (the query genome) into consecutive 1020nt long fragments. This was to keep some kind of - to me, unnecessary - accordance with the practical application of DDH, which also fragments the query genome into similar lengths (but not in the same way!). Since their method discards fragments with less than 30% sequence identity over 70% of their length, this makes it possible that (small) sections of the genome that match exactly may not contribute to the ANI score. Also, repeat regions in the query genome may count twice towards this estimate. Both of these concerns mean that we're not really measuring the overall level of similarity, therefore a measure of sequence similarity that counts all uniquely matching regions between two genome sequences would seem a little more reasonable.
|Figure 1 from Goris et al. describing the observed relationship between DDH and ANIb|
The 2009 paper "Shifting the genomic gold standard for the prokaryotic species definition" (doi:10.1073/pnas.0906412106) by Richter and Rosselló-Móra adapts the ANI method using the MUMmer alignment software to do exactly that. They call this version of the algorithm ANIm, distinguishing it from the original Goris et al. method, which they call ANIb (so named because it uses the BLAST package for alignment). They concluded that the two methods were broadly equivalent, especially when sequence similarity was very high. However, ANIm was more stringent in the selection of genome sequence regions for comparison, leading to some differences in classification.
|Figure 1 from Richter and Rosselló-Móra (2009), showing agreement between ANIb and ANIm methods|
|Figure 2 from Richter and Rosselló-Móra (2009) showing concordance between DDH and ANIm|
Calculating ANIRichter and Rosselló-Móra provide the JSpecies software, which is written in Java, GUI-based, and provides some summary graphics, in addition to ANIb, TETRA, and ANIm output. But I wanted something I could bung into a pipeline, and which would produce the kind of heatmap output that I wanted. So I wrote a script (calculate_ani.py) in Python, which I've made available at GitHub.
To use the script calculate_ani.py, you pass it a directory containing the sequences you want to compare (one FASTA file per isolate), and it returns a plain text tab-separated table of the ANIb/ANIm/TETRA scores - as requested. Optionally, the script will also return a heatmap and dendrogram of the results. It requires Biopython, and local installations of BLAST+ (for ANIb) and MUMmer (for ANIm); for graphical output R and Rpy2 are also required. It uses Python's multiprocessing module to parallelise the BLAST and/or MUMmer searches for input organisms, and speed things up. Output, including intermediate files (for reproducibility), are written to a single output directory. The verbose output is intended to be verbose enough to be useful for reproducible research, and a logfile for output can be specified on the command-line.
For demonstration purposes, we can download the current set of Chlorobium genomes. Setting up a filelist in chlorobium.txt:
These correspond to C.chlorochromatii (NC_007514), C.limicola (NC_010803), C.luteolum (NC_007512), C.phaeobacteroides (NC_010831 and NC_008639), C.phaeovibrioides (NC_009337), and C.tepidum (NC_002932).we can download the set of genomes for this genus with wget:
and then apply the calculate_ani.py script:
On my laptop, for this dataset, ANIb takes about 26s, while ANIm and TETRA each take around 36s of userspace time. Since we've got two members of C.phaeobacteroides, we'd expect these to have ANIb and ANIm scores over 0.95, and a high TETRA score. What do we get?
|ANIb results for Chlorobium isolates: scores below 0.95 suggest that all are distinct species.|
|TETRA results for Chlorobium isolates: the isolates recorded as C.phaeobacteroides and C.limicola show the strongest TETRA scores, but without ANIb/ANIm support, we can't say they're the same species.|
|ANIm results for Chlorobium isolates: scores below 0.95 suggest that all are distinct species.|
Oh dear. Not only do the scores seem to indicate that none of these isolates belong to the same species, we don't even find that the two C.phaeobacteroides isolates group together by anything other than TETRA. I know nothing about Chlorobium phylogenetics, but I'm not hugely surprised, as misnaming of bacterial species in GenBank is known elsewhere (e.g. three of four GenBank records in doi:10.1111/j.1365-3059.2012.02678.x).
It would be nice to have an example that confirms the species relationships assigned in GenBank, so let's try Anaplasma (again, I know nothing about these organisms). Here's the file list:
so there are three A.marginale (NC_022760, NC_012026, NC_004842), and four A.phagocytophilum (NC_021881, NC_021897, NC_007797, NC_021880), with one A.centrale (NC_013532). So we'd expect three groupings, by species.
This looks a bit better than before:
|ANIb results for Anaplasma isolates. Despite the A.phagocytophilum/A.marginale+A.centrale split, the scores seem to support A.marginale as a species, but not much else.|
|ANIm results for Anaplasma isolates. These results support the species classifications as recorded in GenBank.|
|TETRA results for Anaplasma isolates. Again, the A.marginale+A.centrale/A.phagocytophilum split is supported, but no separation between A.marginale and A.centrale is seen.|