Identification - What defines a species?
Defining a “species” can be quite challenging, even for macro-organisms. The boundary between ecological variation and speciation is often vague for certain organisms, none more so than prokaryotes. Closely related species, which are often determined by comparison of small subunit ribosomal sequences (16S rRNA) may have radically different metabolic capabilities, and the presence of horizontal gene transfer lessens the reliability of molecular characters in resolving evolutionary relationships. The technical definition of species used today is 70% sequence similarity between genomes using DNA-DNA reannealing methods. The use of sequence homology among small subunit ribosomal genes has for many years been another standard for defining a species. Research has indicated that a 97% sequence homology between 16S rRNA often corresponds to 70% genome similarity. Other techniques include the use of multi-loci sequence typing (MLST), where several genes, determined to be suitable taxonomic markers, are used to generate a single phylogeny for a set of species. The genes are then aligned, and alignments are concatenated into a single data set with multiple partitions. The use of MLST is a natural lead-in to whole genome based phylogeny, where gene sequences as well as chromosome structural features are incorporated into a single analysis. Development of genome-based phylogenetic techniques may lead to a better understanding of the interplay between vertical inheritence, horizontal gene transfers, and environmental factors in prokaryotic speciation.
What makes for a suitable taxonomic marker?
The definition of a suitable taxonomic marker often focuses on the fact that the gene or gene product plays a critical role in cell functioning, or is a “housekeeping” function, and large changes in sequence are often not tolerated. As such, these genes are highly conserved across a broad range of organisms. Different genes may be better or worse for resolving evolutionary relationships at different taxonomic levels. For example, while the 16S rRNA gene is usually suitable for determining groups at the genus level, the internal transcribed spacer (ITS) region is often used to distinguish between isolates within a genus, due to its higher rate of nucleotide substitution over generations. This can be avoided to some degree by using maximum likelihood based phylogeny methods, as nucleotide substitution rates can be estimated for each gene parition (??) in a concatenated alignment. However, this can only account for a certain level of variation in phylogenetic signal among the different genes, and it is still up to the researcher to choose genes that code for conserved functions across all organisms. One detail that is sometimes overlooked is the quality of the alignments. While computer alignment algorithms align sequences in a mathematical sense, incorporation of biological information about the molecule, such a protein domain alignment or secondary structure, often leads to more accurate phylogenetic inference.
Phylogenetic analysis - inferring evolutionary relationships
Finally, the method of phylogenetic construction should also be considered, as different methods may produce conflicting results. Methods based on the construction of pairwise genetic distance matrices are often very fast, but do not fully utilize the information content within a sequence alignment. Parsimony methods tend to be more accurate, as all characters within an alignment are analyzed, but also suffer from ‘long-branch attraction’. This is described as a tendency to group distantly related taxa together when sequences are highly divergent. Maximum likelihood methods are often described as the most accurate, and have the added benefit of being statistically based. Thus, hypothesis testing can be utilized to better test evolutionary relationships. However, the strength of likelihood methods is also their weakness, as the analysis is model based. Incorrect evolutionary models or over-parameterization may be lead to inaccurate inference, and analysis of many taxa is computationally demanding, leading to prohibitive time requirements. Programs like PHYML, or Bayesian analysis (BEAST, MRBAYES), help reduce this time requirement.