This database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank.
Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI
taxonomy tree.
The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 20000 sequences, and excluding sequences > 25,000 nt in length).
Model organisms are defined as any node (not subtree) having >100 clusters (or more than 20,000 sequences). By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the
sequence diversity in the database. Note, however, that the bulk of "genomic" data for model organisms is not entered in the database at all (see below for types of sequences included).
Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are under construction.
For more information on how the clustering was implemented click
here.
For a list of model organisms click
here.
Please note: The next scheduled rebuild of the database is Summer 2008, at which time we hope to have automated bi-monthly GenBank downloads implemented.
Types of sequences included: Only "core" nucleotide data are included, which excludes ESTs, STSs, and other kinds of bulk or high-throughput sequences. Taxonomic coverage: At present the database contains sequences from eukaryotes. These represent the PLN, MAM, PRI, ROD, VRT, and INV divisions of GenBank.
GenBank release:159 (April 15, 2007) Number of sequences in this database:2593190 Number of nodes in our subtree(s) of the NCBI taxonomy tree:240708 Number of terminal nodes:182267 Number of nodes clustered (usually terminal taxa):181992 Number of subtrees clustered (always internal nodes):57631 Number of nodes with sequences that can be clustered:236023
Clusters:
Total number of clusters:1599650
Number of phylogenetically informative clusters (TIs >= 4):87084
Number of singleton clusters (GIs = 1):1189605
Number of large clusters (GIs >= 100):10338
Number of large clusters (TIs >= 100):3042
Size of largest cluster (w.r.t. GIs):7479
Size of largest cluster (w.r.t. TIs):4070
Questions or comments? Contact Mike Sanderson (sanderm at email dot arizona dot edu)