| All
life on earth is related in a
complex genealogy—much of which can be described by a phylogenetic
“tree of life.” The advent of rapid and inexpensive molecular
sequencing has generated vast databases of easily accessible
comparative sequence data, which has been widely used to construct
relatively small subtrees of this larger tree. Whereas extensive
research has focused on the problem of building a tree from a single
data set; relatively little is known about extracting these data sets
en masse from sequence databases and then assembling a synthesis. Novel
computational problems arise naturally in these two areas that are as
challenging as the basic tree building problem itself. The scale of the
data input is large: GenBank, for example, now archives data on some 24
million sequences from over 120,000 species. We propose to study new
algorithmic and computational problems that arise in reconstructing the
large set of phylogenetic trees of sequences encompassed by databases
of such size. Moreover, we will examine what this complex collection of
partially overlapping trees of sequences implies about the phylogenetic
tree of species. Our research necessarily includes theoretical and empirical efforts. Theoretical work will be aimed at solving specific computational problems necessary to extract and assemble data sets and construct comprehensive phylogenetic trees. These problems concern three broad issues: assessing the potential information content of sequence databases; optimal extraction of data from databases to construct species trees from sequences; and integration of species trees into "supertrees", which are larger trees assembled from smaller ones that share some but not all species in common. A key problem in the last area is targeting future sequencing efforts by identifying minimal sets of new sequences needed to construct optimal supertrees. Empirical work will focus on analysis of three databases that pose a range of computational challenges and represent a fair sample of databases containing useful phylogenetic information (three taxonomically enriched subsets of GenBank, all of SWISS-PROT, and the TIGR EGO database of expressed sequence tags from model eukaryotic organisms). This work will characterize the phylogenetic information content of these sequence sets, identify maximal sets of combinable sequence information, construct nonredundant partitions of the database to permit estimation of gene trees, assemble species trees from gene trees, and supertrees from species trees. Quality of the final supertrees will be tested by a cross-validation procedure. To implement these projects we have assembled an interdisciplinary team of phylogenetic biologists and computer scientists with experience in phylogenetic theory, data analysis, and algorithm development and implementation. We have also established collaborations with three ongoing taxon-oriented Tree-of-Life projects to provide tests of our sequence targeting algorithms. The work is intended to have several broader impacts. Data sets, software and tools will be made available to the scientific community. Because it crosses disciplines, it will integrate advances in algorithm theory with the latest analyses of biological data arising from genome research and biodiversity studies. It may help foster a shift in the way phylogenetics is currently practiced—away from a focus on construction of individual data sets and trees and toward a more synthetic focus on collections of trees. As part of an explicitly interdisciplinary effort, students and postdoctoral fellows will receive unusually broad training through internships at other team members’ labs. This will promote development of expertise that will be important in continuing efforts along these lines in years to come. Finally, the entire project dovetails with the much larger community-wide effort to build the Tree of Life using all information on biological diversity by focusing on a readily accessible subset of this information. |