| Every node in the NCBI taxonomy tree has an associated collection of sequences for it, which consists of those sequences for itself (if any) and all its descendants. For each node these sequences are "clustered" as long as the number of sequences is not too large (presently 20,000 sequences). The purpose of the clustering pipeline is to assemble sets of sequences together that have at least local homologies (i.e., matching or nearly matching subsequences). These can form the basis for the construction of individual phylogenetic data sets. Subsequently, clusters can be aligned and then combined using supermatrix, supertree, or other approaches. |
|
|
We use BLAST at the core of this pipeline to identify all local homologies ("hits") between every pair of
sequences. Next, the hits can be filtered in a variety of ways. For example, for phylogenetic purposes it is
ideal to have sequences of nearly the same length so that alignment programs work well and there is little
missing data. Thus, the list of hits can be filtered to exclude small regions of local homology. Once a final
list of hits is obtained, a set of clusters of sequences is built. Presently, we filter
by keeping only hits that have greater than 51% coverage in both directions (at a stringent BLAST e-value cutoff
of -10).
Then the filtered hit list is turned into a set of clusters via "single-linkage clustering". To be a member of such a cluster, a sequence merely has to have a hit with any other member of the cluster. Even at this stage, stricter, smaller, clusters can be obtained by other clustering methods, such as complete linkage clustering. Clustering algorithms are an active area of research and many alternatives are available. The output of the pipeline is thus a set of clusters, each of which contains one or more sequences. Caveats
Treatment of model organisms Model organisms are defined as any node in the NCBI tree in which there are >20,000 sequences (in which case no clusters are constructed for that node), or, if <20,000 sequences, the clustering procedure outlined above produces > 100 clusters. Thus, organisms that have many sequences from one locus (e.g., from population genetic studies) will have only a small number of clusters and will not be considered as models. At present several hundred model organisms are recognized according to these criteria: most have <20,000 sequences but >100 clusters. Model organisms received special handling for the construction of the Phylota Browser database. The user can select whether or not sequence tallies for higher taxa report or do not report sequences from model organisms within their group. More fundamentally, construction of the higher taxon clusters treats model organisms differently. Since each tends to contain a large to very large number of sequences, most of which are phylogenetic singletons, it is computationally expensive and a bit wasteful to include them in all-against-all BLAST searches at each higher taxonomic level. Instead, we initially exclude them from clustering, build all clusters in the database without them, and then use BLAST to find sequences in the model organisms that are homologous to these already-constructed clusters. This can be done quite efficiently, at the risk of some considerable dependence on the representation of clusters in the phylogenetic neighborhood of the model organisms. In other words, this procedure does not add new clusters to the database; it merely adds sequences to those clusters that would have been present anyway in the relatives of the model organisms. Nonetheless, in practice it seems to convert many phylogenetically uninformative clusters to informative clusters around the model organisms and thereby increase the density of the data availability matrices for informative data in these regions of the tree. See list of model organisms in the current release here. |