The Sequence of the Human Genome

American Association for the Advancement of Science (AAAS) - Tập 291 Số 5507 - Trang 1304-1351 - 2001
J. Craig Venter1, Mark D. Adams1, Eugene W. Myers1, Peter W. Li1, Richard Mural1, Granger G. Sutton1, Hamilton O. Smith1, Mark Yandell1, Cheryl Evans1, Robert A. Holt1, Jeannine D. Gocayne1, Peter G. Amanatides1, Richard M. Ballew1, Daniel H. Huson1, Jennifer R. Wortman1, Qing Zhang1, Chinnappa D. Kodira1, Xiangqun Zheng-Bradley1, Lin Chen1, Marian Skupski1, G. Subramanian1, Paul D Thomas1, Jinghui Zhang1, George L. Gabor Miklos2, Catherine R. Nelson3, Samuel Broder1, Andrew G. Clark4, J H Nadeau5, Victor A. McKusick6, Norton D. Zinder7, Arnold J. Levine7, Richard J. Roberts8, Melvin l. Simon9, Carolyn W. Slayman10, Michael W. Hunkapiller11, Randall Bolanos1, Arthur L. Delcher1, Ian Dew1, Daniel Fasulo1, Michael J. Flanigan1, Liliana Florea1, Daniel L. Halligan1, Sridhar Hannenhalli1, Saul Kravitz1, Samuel Lévy1, Clark Mobarry1, Knut Reinert1, Karin Remington1, Jane Abu-Threideh1, Ellen M. Beasley1, Kendra Biddick1, Vivien Bonazzi1, Rhonda Brandon1, Michele Cargill1, Ishwar Chandramouliswaran1, Rosane Charlab1, Kabir Chaturvedi1, Zuoming Deng1, Valentina Di Francesco1, Patrick Dunn1, Karen Eilbeck1, Carlos Evangelista1, Andrei Gabrielian1, Weiniu Gan1, Wangmao Ge1, Fangcheng Gong1, Zhiping Gu1, Ping Guan1, Thomas J. Heiman1, Maureen E. Higgins1, Rui‐Ru Ji1, Zhaoxi Ke1, Karen A. Ketchum1, Zhongwu Lai1, Yiding Lei1, Zhenya Li1, Jiayin Li1, Yong Liang1, Xiaoying Lin1, Fu Lu1, Gennady V. Merkulov1, Natalia V. Milshina1, Helen M. Moore1, Ashwinikumar K. Naik1, Vaibhav A. Narayan1, Beena Neelam1, Deborah Nusskern1, Douglas B. Rusch1, Steven L. Salzberg12, Wei Shao1, Bixiong Chris Shue1, Jing‐Tao Sun1, Zhen Yuan Wang1, Aihui Wang1, Xin Wang1, Jun Wang1, Minghui Wei1, Ron Wides13, Chunlin Xiao1, Chunhua Yan1, Alison Yao1, Jane J. Ye1, Ming Zhan1, Weiqing Zhang1, Hongyu Zhang1, Qi Zhao1, Liansheng Zheng1, Fei Zhong1, Wenyan Zhong1, Shiaoping C. Zhu1, Claire M. Fraser12, Dennis A. Gilbert1, Suzanna Baumhueter1, Gene Spier1, Christine Carter1, Anibal Cravchik1, Trevor Woodage1, Feroze Ali1, Hui-Jin An1, Aderonke Awe1, Danita Baldwin1, Holly Baden1, Mary Barnstead1, Ian Barrow1, Karen Beeson1, Dana Busam1, Amy Carver1, Angela Center1, Ming Lai Cheng1, Liz Curry1, Steve Danaher1, Lionel B. Davenport1, Raymond Desilets1, Susanne Dietz1, Kristina Dodson1, Lisa Doup1, Steven Ferriera1, Neha Garg1, Andres Gluecksmann1, Brit J. Hart1, Jason Haynes1, Charles A. Haynes1, Cheryl Heiner1, Suzanne L. Hladun1, Damon Hostin1, Jarrett Houck1, Timothy J. Howland1, Chinyere Ibegwam1, Jeffery E. Johnson1, Francis Kalush1, Lesley Kline1, Shashi Koduru1, Amy Love1, F.H. Mann1, David May1, Steven McCawley1, Tina C. McIntosh1, Ivy McMullen1, Mee Moy1, Linda Moy1, Brian J. Murphy1, K. E. Nelson1, Cynthia Pfannkoch1, Eric C. Pratts1, Vinita Puri1, Hina Qureshi1, Matthew S. Reardon1, Robert Rodriguez1, Yu-Hui Rogers1, Deanna L. Romblad1, Bob Ruhfel1, Rodney J. Scott1, Cynthia D. Sitter1, Michelle Smallwood1, Erin Stewart1, Renee Strong1, Ellen Suh1, Russell S. Thomas1, Ni Ni Tint1, Sukyee Tse1, Claire Vech1, Gary Wang1, Jeremy Wetter1, S. Williams1, Monica S. Williams1, Sandra M. Windsor1, Emily S. Winn-Deen1, Keriellen Wolfe1, Jayshree Zaveri1, K. Zaveri1, Josep F. Abril14, Roderic Guigó14, Michael J. Campbell1, Kimmen Sjölander1, Brian Karlak1, Anish Kejariwal1, Betty V. Lazareva1, Thomas W. Hatton1, Apurva Narechania1, Karen Diemer1, Anushya Muruganujan1, Nan Guo1, Shinji Sato1, Vineet Bafna1, Sorin Istrail1, Ross A. Lippert1, Russell Schwartz1, Brian Walenz1, Shibu Yooseph1, David R. Allen1, Anand Basu1, James Baxendale1, Louis Blick1, Marcelo Caminha1, John Carnes-Stine1, Parris M. Caulk1, Yen-Hui Chiang1, My D. Coyne1, Carl Dahlke1, Anne Deslattes Mays1, Maria Dombroski1, Michael Donnelly1, Dale Ely1, Shiva Esparham1, Carl Fosler1, Harold C. Gire1, Stephen Glanowski1, Kenneth Glasser1, Anna Glodek1, Mark Gorokhov1, Ken Graham1, Barry Gropman1, Michael A. Harris1, Jeremy Heil1, Scott N. Henderson1, Jeffrey P. Hoover1, Donald E. Jennings1, Catherine Jordan1, James M. Jordan1, John Kasha1, Leonid Kagan1, Ðắc-Trung Nguyễn1, Alexander A. Levitsky1, Mark G. Lewis1, Xiangjun Liu1, John Lopez1, J. Daniel1, William H. Majoros1, Joe W. McDaniel1, Sean D. Murphy1, Matthew Newman1, Ngoc B. Nguyen1, Marc Nodell1, Sue Pan1, Jim Peck1, Marshall Peterson1, William Rowe1, Robert D. Sanders1, John Scott1, Michael A. Simpson1, Thomas J. Smith1, Arlan C. Sprague1, Timothy B. Stockwell1, R. Turra1, Vera Lúcia da Silva Valente1, Mei Wang1, Mei Wen1, David Wu1, Mitchell M. Wu1, Ashley C. Xia1, Ali Zandieh1, Zhu Xiao-hong1
1Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA
2GenetixXpress, 78 Pacific Road, Palm Beach, Sydney 2108, Australia.
3Berkeley Drosophila Genome Project, University of California, Berkeley, CA 94720, USA.
4Department of Biology, Penn State University, 208 Mueller Lab, University Park, PA 16802, USA.
5Department of Genetics, Case Western Reserve University School of Medicine, BRB-630, 10900 Euclid Avenue, Cleveland, OH 44106, USA.
6Johns Hopkins University School of Medicine, Johns Hopkins Hospital, 600 North Wolfe Street, Blalock 1007, Baltimore, MD 21287–4922, USA.
7Rockefeller University, 1230 York Avenue, New York, NY 10021–6399, USA.
8New England Biolabs (United States), Ipswich, United States
9Division of Biology, 147-75, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA.
10Yale University School of Medicine, 333 Cedar Street, P.O. Box 208000, New Haven, CT 06520–8000, USA.
11Applied Biosystems, 850 Lincoln Centre Drive, Foster City CA 94404, USA
12The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA
13Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 52900, Israel
14Grup de Recerca en Informàtica Mèdica, Institut Municipal d'Investigació Mèdica, Universitat Pompeu Fabra, 08003-Barcelona, Catalonia, Spain.

Tóm tắt

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ∼12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

Từ khóa


Tài liệu tham khảo

10.1016/0888-7543(89)90142-0

; U.S. Department of Energy Office of Health and Environmental Research Sequencing the Human Genome: Summary Report of the Santa Fe Workshop Santa Fe NM 3 to 4 March 1986 (Los Alamos National Laboratory Los Alamos NM 1986).

R. Cook-Deegan The Gene Wars: Science Politics and the Human Genome (Norton New York 1996).

10.1038/265687a0

Seeburg P. H., et al., Trans. Assoc. Am. Physicians 90, 109 (1977).

10.1016/0003-2697(86)90536-1

10.1073/pnas.84.23.8296

10.3109/10425179209034023

10.1038/ng0892-348

10.3109/10425179109020778

10.1126/science.2047873

10.1038/355632a0

10.1038/ng0793-256

10.1038/ng0893-373

10.1038/ng0893-381

10.1038/5976

Adams M. D., et al., Nature 377, 3 (1995);

10.1093/nar/21.16.3829

10.1016/0022-2836(82)90546-0

Mahy B. W. J., Esposito J. J., Venter J. C., Am. Soc. Microbiol. News 57, 577 (1991).

10.1126/science.7542800

10.1126/science.270.5235.397

10.1126/science.273.5278.1058

10.1038/41483

10.1038/37052

10.1038/381364a0

10.1006/geno.1996.0154

10.1006/geno.1999.6082

10.1038/45471

10.1101/gr.7.5.401

10.1101/gr.7.5.410

10.1126/science.280.5367.1185

10.1126/science.280.5369.1540

10.1038/368474a0

10.1126/science.280.5366.994

10.1126/science.287.5461.2185

10.1126/science.287.5461.2204

10.1126/science.287.5461.2196

10.1126/science.282.5389.682

International Human Genome Sequencing Consortium (2001) Nature 409 860 (2001).

Institutional review board: P. Calabresi (chairman) H. P. Freeman C. McCarthy A. L. Caplan G. D. Rogell J. Karp M. K. Evans B. Margus C. L. Carter R. A. Millman S. Broder.

Eligibility criteria for participation in the study were as follows: prospective donors had to be 21 years of age or older not pregnant and capable of giving an informed consent. Donors were asked to self-define their ethnic backgrounds. Standard blood bank screens (screening for HIV hepatitis viruses and so forth) were performed on all samples at the clinical laboratory prior to DNA extraction in the Celera laboratory. All samples that tested positive for transmissible viruses were ineligible and were discarded. Karyotype analysis was performed on peripheral blood lymphocytes from all samples selected for sequencing; all were normal. A two-staged consent process for prospective donors was employed. The first stage of the consent process provided information about the genome project procedures and risks and benefits of participating. The second stage of the consent process involved answering follow-up questions and signing consent forms and was conducted about 48 hours after the first.

DNA was isolated from blood (173) or sperm. For sperm a washed pellet (100 μl) was lysed in a suspension (1 ml) containing 0.1 M NaCl 10 mM tris-Cl–20 mM EDTA (pH 8) 1% SDS 1 mg proteinase K and 10 mM dithiothreitol for 1 hour at 37°C. The lysate was extracted with aqueous phenol and with phenol/chloroform. The DNA was ethanol precipitated and dissolved in 1 ml TE buffer. To make genomic libraries DNA was randomly sheared end-polished with consecutive BAL31 nuclease and T4 DNA polymerase treatments and size-selected by electrophoresis on 1% low-melting-point agarose. After ligation to Bst XI adapters (Invitrogen catalog no. N408-18) DNA was purified by three rounds of gel electrophoresis to remove excess adapters and the fragments now with 3′-CACA overhangs were inserted into Bst XI-linearized plasmid vector with 3′-TGTG overhangs. Libraries with three different average sizes of inserts were constructed: 2 10 and 50 kbp. The 2-kbp fragments were cloned in a high-copy pUC18 derivative. The 10- and 50-kbp fragments were cloned in a medium-copy pBR322 derivative. The 2- and 10-kbp libraries yielded uniform-sized large colonies on plating. However the 50-kbp libraries produced many small colonies and inserts were unstable. To remedy this the 50-kbp libraries were digested with Bgl II which does not cleave the vector but generally cleaved several times within the 50-kbp insert. A 1264-bp Bam HI kanamycin resistance cassette (purified from pUCK4; Amersham Pharmacia catalog no. 27-4958-01) was added and ligation was carried out at 37°C in the continual presence of Bgl II. As Bgl II–Bgl II ligations occurred they were continually cleaved whereas Bam HI–Bgl II ligations were not cleaved. A high yield of internally deleted circular library molecules was obtained in which the residual insert ends were separated by the kanamycin cassette DNA. The internally deleted libraries when plated on agar containing ampicillin (50 μg/ml) carbenicillin (50 μg/ml) and kanamycin (15 μg/ml) produced relatively uniform large colonies. The resulting clones could be prepared for sequencing using the same procedures as clones from the 10-kbp libraries.

Transformed cells were plated on agar diffusion plates prepared with a fresh top layer containing no antibiotic poured on top of a previously set bottom layer containing excess antibiotic to achieve the correct final concentration. This method of plating permitted the cells to develop antibiotic resistance before being exposed to antibiotic without the potential clone bias that can be introduced through liquid outgrowth protocols. After colonies had grown QBot (Genetix UK) automated colony-picking robots were used to pick colonies meeting stringent size and shape criteria and to inoculate 384-well microtiter plates containing liquid growth medium. Liquid cultures were incubated overnight with shaking and were scored for growth before passing to template preparation. Template DNA was extracted from liquid bacterial culture using a procedure based upon the alkaline lysis miniprep method (173) adapted for high throughput processing in 384-well microtiter plates. Bacterial cells were lysed; cell debris was removed by centrifugation; and plasmid DNA was recovered by isopropanol precipitation and resuspended in 10 mM tris-HCl buffer. Reagent dispensing operations were accomplished using Titertek MAP 8 liquid dispensing systems. Plate-to-plate liquid transfers were performed using Tomtec Quadra 384 Model 320 pipetting robots. All plates were tracked throughout processing by unique plate barcodes. Mated sequencing reads from opposite ends of each clone insert were obtained by preparing two 384-well cycle sequencing reaction plates from each plate of plasmid template DNA using ABI-PRISM BigDye Terminator chemistry (Applied Biosystems) and standard M13 forward and reverse primers. Sequencing reactions were prepared using the Tomtec Quadra 384-320 pipetting robot. Parent-child plate relationships and by extension forward-reverse sequence mate pairs were established by automated plate barcode reading by the onboard barcode reader and were recorded by direct LIMS communication. Sequencing reaction products were purified by alcohol precipitation and were dried sealed and stored at 4°C in the dark until needed for sequencing at which time the reaction products were resuspended in deionized formamide and sealed immediately to prevent degradation. All sequence data were generated using a single sequencing platform the ABI PRISM 3700 DNA Analyzer. Sample sheets were created at load time using a Java-based application that facilitates barcode scanning of the sequencing plate barcode retrieves sample information from the central LIMS and reserves unique trace identifiers. The application permitted a single sample sheet file in the linking directory and deleted previously created sample sheet files immediately upon scanning of a sample plate barcode thus enhancing sample sheet-to-plate associations.

10.1073/pnas.74.12.5463

10.1126/science.2443975

Celera's computing environment is based on Compaq Computer Corporation's Alpha system technology running the Tru64 Unix operating system. Celera uses these Alphas as Data Servers and as nodes in a Virtual Compute Farm all of which are connected to a fully switched network operating at Fast Ethernet speed (for the VCF) and gigabit Ethernet speed (for data servers). Load balancing and scheduling software manages the submission and execution of jobs based on central processing unit (CPU) speed memory requirements and priority. The Virtual Compute Farm is composed of 440 Alpha CPUs which includes model EV6 running at a clock speed of 400 MHz and EV67 running at 667 MHz. Available memory on these systems ranges from 2 GB to 8 GB. The VCF is used to manage trace file processing and annotation. Genome assembly was performed on a GS 160 running 16 EV67s (667 MHz) and 64 GB of memory and 10 ES40s running 4 EV6s (500 MHz) and 32 GB of memory. A total of 100 terabytes of physical disk storage was included in a Storage Area Network that was available to systems across the environment. To ensure high availability file and database servers were configured as 4-node Alpha TruClusters so that services would fail over in the event of hardware or software failure. Data availability was further enhanced by using hardware- and software-based disk mirroring (RAID-0) disk striping (RAID-1) and disk striping with parity (RAID-5).

Trace processing generates quality values for base calls by means of Paracel's TraceTuner trims sequence reads according to quality values trims vector and adapter sequence from high-quality reads and screens sequences for contaminants. Similar in design and algorithm to the phred program (174) TraceTuner reports quality values that reflect the log-odds score of each base being correct. Read quality was evaluated in 50-bp windows each read being trimmed to include only those consecutive 50-bp segments with a minimum mean accuracy of 97%. End windows (both ends of the trace) of 1 5 10 25 and 50 bases were trimmed to a minimum mean accuracy of 98%. Every read was further checked for vector and contaminant matches of 50 bp or more and if found the read was removed from consideration. Finally any match to the 5′ vector splice junction in the initial part of a read was removed.

National Center for Biotechnology Information (NCBI); available at www.ncbi.nlm.nih.gov/.

NCBI; available at www.ncbi.nlm.nih.gov/HTGS/.

All bactigs over 3 kbp were examined for coverage by Celera mate pairs. An interval of a bactig was deemed an assembly error where there were no mate pairs spanning the interval and at least two reads that should have their mate on the other side of the interval but did not. In other words there was no mate pair evidence supporting a join in the breakpoint interval and at least two mate pairs contradicting the join. By this criterion we detected and broke apart bactigs at 13 037 locations or equivalently we found 2.13% of the bactigs to be misassembled.

We considered a BAC entry to be chimeric if by the Lander-Waterman statistic (175) the odds were 0.99 or more that the assembly we produced was inconsistent with the sequence coming from a single source. By this criterion 714 or 2.2% of BAC entries were deemed chimeric.

10.1089/cmb.1996.3.563

E. W. Myers J. L. Weber in Computational Methods in Genome Research S. Suhai Ed. (Plenum New York 1996) pp. 73–89.

P. Deloukas et al. Science 282 744 (1998).

M. A. Marra et al. Genome Res. 7 1072 (1997).

J. Zhang et al. data not shown.

Shredded bactigs were located on long CSA scaffolds (>500 kbp) and the distribution of these fragments on the scaffolds was analyzed. If the spread of these fragments was greater than four times the reported BAC length the BAC was considered to be chimeric. In addition if >20% of bactigs of a given BAC were found on a different scaffolds that were not adjacent in map position then the BAC was also considered as chimeric. The total chimeric BACs divided by the number of BACs used for CSA gave the minimal estimate of chimerism rate.

10.1038/35012518

10.1038/990031

10.1073/pnas.230438397

The International RH Mapping Consortium available at www.ncbi.nlm.nih.gov/genemap99/.

See Masker.html.

10.1016/S0167-7799(98)01232-3

10.1016/S0022-2836(05)80360-2

10.1126/science.1057437

See .

10.1126/science.6189184

10.1093/nar/11.16.5497

10.1038/43722

10.1038/76115

10.1038/76118

M. Yandell in preparation.

10.1016/S0168-9525(99)01882-X

Scaffolds containing greater than 10 kbp of sequence were analyzed for features of biological importance through a series of computational steps and the results were stored in a relational database. For scaffolds greater than one megabase the sequence was cut into single megabase pieces before computational analysis. All sequence was masked for complex repeats using Repeatmasker (52) before gene finding or homology-based analysis. The computational pipeline required ∼7 hours of CPU time per megabase including repeat masking or a total compute time of about 20 000 CPU hours. Protein searches were performed against the nonredundant protein database available at the NCBI. Nucleotide searches were performed against human mouse and rat Celera Gene Indices (assemblies of cDNA and EST sequences) mouse genomic DNA reads generated at Celera (3×) the Ensembl gene database available at the European Bioinformatics Institute (EBI) human and rodent (mouse and rat) EST data sets parsed from the dbEST database (NCBI) and a curated subset of the RefSeq experimental mRNA database (NCBI). Initial searches were performed on repeat-masked sequence with BLAST 2.0 (54) optimized for the Compaq Alpha compute-server and an effective database size of 3 × 10 9 for BLASTN searches and 1 × 10 9 for BLASTX searches. Additional processing of each query-subject pair was performed to improve the alignments. All protein BLAST results having an expectation score of <1 × 10 −4 human nucleotide BLAST results having an expectation score of <1 × 10 −8 with >94% identity and rodent nucleotide BLAST results having an expectation score of <1 × 10 8 with >80% identity were then examined on the basis of their high-scoring pair (HSP) coordinates on the scaffold to remove redundant hits retaining hits that supported possible alternative splicing. For BLASTX searches analysis was performed separately for selected model organisms (yeast mouse human C. elegans and D. melanogaster ) so as not to exclude HSPs from these organisms that support the same gene structure. Sequences producing BLAST hits judged to be informative nonredundant and sufficiently similar to the scaffold sequence were then realigned to the genomic sequence with Sim4 for ESTs and with Lap for proteins. Because both of these algorithms take splicing into account the resulting alignments usually give a better representation of intron-exon boundaries than standard BLAST analyses and thus facilitate further annotation (both machine and human). In addition to the homology-based analysis described above three ab initio gene prediction programs were used (63).

10.1016/S0076-6879(96)66018-2

10.1006/jmbi.1997.0951

10.1016/S0076-6879(99)03007-4

10.1101/gr.10.4.516

; Floreal et al. Genome Res. 8 967 (1998).

Miklos G. L., John B., Am. J. Hum. Genet. 31, 264 (1979);

10.1159/000133633

P. E. Warburton H. F. Willard in Human Genome Evolution M. S. Jackson T. Strachan G. Dover Eds. (BIOS Scientific Oxford 1996) pp. 121–145.

10.1101/gr.10.6.839

10.1016/0168-9525(89)90055-3

Holmquist G. P., Am. J. Hum. Genet. 51, 17 (1992).

10.1016/S0378-1119(99)00485-0

10.1016/0378-1119(96)00393-9

10.1016/0168-9525(85)90070-8

10.1086/302011

10.1146/annurev.genet.34.1.331

10.1016/0168-9525(87)90294-0

10.1016/0022-2836(87)90689-9

10.1016/0888-7543(92)90024-M

10.1016/0959-437X(95)80044-1

J. Peters Genome Biol. 1 reviews1028.1 (2000) ().

10.1093/hmg/9.18.2651

10.1073/pnas.90.24.11995

10.1007/s003350010071

10.1016/S0378-1119(00)00089-5

10.1093/nar/23.1.98

10.1093/hmg/9.14.2117

10.1074/jbc.274.35.24849

10.1006/geno.1999.5874

10.1007/BF01435251

10.1101/gr.10.5.672

Lek first compares all proteins in the proteome to one another. Next the resulting BLAST reports are parsed and a graph is created wherein each protein constitutes a node; any hit between two proteins with an expectation beneath a user-specified threshold constitutes an edge. Lek then uses this graph to compute a similarity between each protein pair ij in the context of the graph as a whole by simply dividing the number of BLAST hits shared in common between the two proteins by the total number of proteins hit by i and j. This simple metric has several interesting properties. First because the similarity metric takes into account both the similarity and the differences between the two sequences at the level of BLAST hits the metric respects the multidomain nature of protein space. Two multidomain proteins for instance each containing domains A and B will have a greater pairwise similarity to each other than either one will have to a protein containing only A or B domains so long as A-B–containing multidomain proteins are less frequent in the proteome than are single-domain proteins containing A or B domains. A second interesting property of this similarity metric is that it can be used to produce a similarity matrix for the proteome as a whole without having to first produce a multiple alignment for each protein family an error-prone and very time-consuming process. Finally the metric does not require that either sequence have significant homology to the other in order to have a defined similarity to each other only that they share at least one significant BLAST hit in common. This is an especially interesting property of the metric because it allows the rapid recovery of protein families from the proteome for which no multiple alignment is possible thus providing a computational basis for the extension of protein homology searches beyond those of current HMM- and profile-based search methods. Once the whole-proteome similarity matrix has been calculated Lek first partitions the proteome into single-linkage clusters (27) on the basis of one or more shared BLAST hits between two sequences. Next these single-linkage clusters are further partitioned into subclusters each member of which shares a user-specified pairwise similarity with the other members of the cluster as described above. For the purposes of this publication we have focused on the analysis of single-linkage clusters and what we have termed “complete clusters ” e.g. those subclusters for which every member has a similarity metric of 1 to every other member of the subcluster. We believe that the single-linkage and complete clusters are of special interest in part because they allow us to estimate and to compare sizes of core protein sets in a rigorous manner. The rationale for this is as follows: if one imagines for a moment a perfect clustering algorithm capable of perfectly partitioning one or more perfectly annotated protein sets into protein families it is reasonable to assume that the number of clusters will always be greater than or equal to the number of single-linkage clusters because single-linkage clustering is a maximally agglomerative clustering method. Thus if there exists a single protein in the predicted protein set containing domains A and B then it will be clustered by single linkage together with all single-domain proteins containing domains A or B. Likewise for a predicted protein set containing a single multidomain protein the number of real clusters must always be less than or equal to the number of complete clusters because it is impossible to place a unique multidomain protein into a complete cluster. Thus the single-linkage and complete clusters plus singletons should comprise a lower and upper bound of sizes of core protein sets respectively allowing us to compare the relative size and complexity of different organisms' predicted protein set.

10.1016/0022-2836(81)90087-5

10.1093/nar/27.11.2369

Arabidopsis Genome Initiative Nature 408 796 (2000).

The probability that a contiguous set of proteins is the result of a segmental duplication can be estimated approximately as follows. Given that protein A and B occur on one chromosome and that A′ and B′ (paralogs of A and B) also exist in the genome the probability that B′ occurs immediately after A′ is 1/ N where N is the number of proteins in the set (for this analysis N = 26 588). Allowing for B′ to occur as any of the next J-1 proteins [leaving a gap between A′ and B′ increases the probability to ( J – 1)/ N ; allowing B′A′ or A′B′ gives a probability of 2( J – 1)/ N ]. Considering three genes ABC the probability of observing A′B′C′ elsewhere in the genome given that the paralogs exist is 1/ N 2 . Three proteins can occur across a spread of five positions in six ways; more generally we compute the number of ways that K proteins can be spread across J positions by counting all possible arrangements of K – 2 proteins in the J – 2 positions between the first and last protein. Allowing for a spread to vary from K positions (no gaps) to J gives L=∑X=K−2J−2 XK−2arrangements. Thus the probability of chance occurrence is L / N K–1 . Allowing for both sets of genes (e.g. ABC and A′B′C′) to be spread across J positions increases this to L 2 / N K–1 . The duplicated segment might be rearranged by the operations of reversal or translocation; allowing for M such rearrangements gives us a probability P = L 2 M / N K–1 . For example the probability of observing a duplicated set of three genes in two different locations where the three genes occur across a spread of five positions in both locations is 36/ N 2 ; the expected number of such matched sets in the predicted protein set is approximately ( N )36/ N 2 = 36/ N a value «1. Therefore any such duplications of three genes are unlikely to result from random rearrangements of the genome. If any of the genes occur in more than two copies the probability that the apparent duplication has occurred by chance increases. The algorithm for selecting candidate duplications only generates matched protein sets with P « 1.

10.1093/hmg/7.1.13

10.1006/geno.1999.5900

10.1101/gr.144700

10.1002/(SICI)1097-0061(200004)17:1<22::AID-YEA5>3.0.CO;2-S

10.1038/46555

Reviewed in

10.1016/S0959-437X(98)80039-7

10.1101/gr.8.7.748

10.1101/gr.9.5.499

10.1038/35035083

10.1038/70570

W.-H. Li Molecular Evolution (Sinauer Sunderland MA 1997).

10.1038/10290

10.1038/10297

10.1101/gr.7.6.649

M. Nei Molecular Evolutionary Genetics (Columbia Univ. Press New York 1987).

From the observed coverage of the sequences at each site for each individual we calculated the probability that a SNP would be detected at the site if it were present. For each level of coverage there is a binomial sampling of the two homologs for each individual and a heterozygous site could only be ascertained if both homologs are present or if two alleles from different individuals are present. With coverage x from a given individual both homologs are present in the assembly with probability 1 − (1/2)x −1 . Even if both homologs are present the probability that a SNP is detected is <1 because a fraction of sites failed the quality criteria. Integrating over coverage levels the binomial sampling and the quality distribution we derived an expected number of sites in the genome that were ascertained for polymorphism for each individual. The nucleotide diversity was then the observed number of variable sites divided by the expected number of sites ascertained.

10.1093/genetics/150.3.1133

D. A. Nickerson et al. Nature Genet. 19 233 (1998);

10.1101/gr.146900

10.1086/302825

10.1126/science.280.5366.1077

10.1016/S0168-9525(00)02030-8

10.1016/0040-5809(84)90027-3

R. R. Hudson in Oxford Surveys in Evolutionary Biology D. J. Futuyma J. D. Antonovics Eds. (Oxford Univ. Press Oxford 1990) vol. 7 pp. 1–44.

10.1086/301977

M. Kimura The Neutral Theory of Molecular Evolution (Cambridge Univ. Press Cambridge 1983).

10.1038/8785

10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L

10.1093/nar/28.1.263

Brief description of the methods used to build the Panther classification. First the June 2000 release of the GenBank NR protein database (excluding sequences annotated as fragments or mutants) was partitioned into clusters using BLASTP. For the clustering a seed sequence was randomly chosen and the cluster was defined as all sequences matching the seed to statistical significance (E-value < 10 −5 ) and “globally” alignable (the length of the match region must be >70% and <130% of the length of the seed). If the cluster had more than five members and at least one from a multicellular eukaryote the cluster was extended. For the extension step a hidden Markov Model (HMM) was trained for the cluster using the SAM software package version 2. The HMM was then scored against GenBank NR (excluding mutants but including fragments for this step) and all sequences scoring better than a specific (NLL-NULL) score were added to the cluster. The HMM was then retrained (with fixed model length) and all sequences in the cluster were aligned to the HMM to produce a multiple sequence alignment. This alignment was assessed by a number of quality measures. If the alignment failed the quality check the initial cluster was rebuilt around the seed using a more restrictive E-value followed by extension alignment and reassessment. This process was repeated until the alignment quality was good. The multiple alignment and “general” (i.e. describing the entire cluster or “family”) HMM (176) were then used as input into the BETE program (177). BETE calculates a phylogenetic tree for the sequences in the alignment. Functional information about the sequences in each cluster were parsed from SwissProt (178) and GenBank records. “Tree-attribute viewer” software was used by biologist curators to correlate the phylogenetic tree with protein function. Subfamilies were manually defined on the basis of shared function across subtrees and were named accordingly. HMMs were then built for each subfamily using information from both the subfamily and family (K. Sjölander in preparation). Families were also manually named according to the functions contained within them. Finally all of the families and subfamilies were classified into categories and subcategories based on their molecular functions. The categorization was done by manual review of the family and subfamily names by examining SwissProt and GenBank records and by review of the literature as well as resources on the World Wide Web. The current version (2.0) of the Panther molecular function schema has four levels: category subcategory family and subfamily. Protein sequences for whole eukaryotic genomes (for the predicted human proteins and annotated proteins for fly worm yeast and Arabidopsis ) were scored against the Panther library of family and subfamily HMMs. If the score was significant (the NLL-NULL score cutoff depends on the protein family) the protein was assigned to the family or subfamily function with the most significant score.

10.1093/nar/27.1.229

A. Goffeau et al. Science 274 546 563 (1996).

C. elegans Sequencing Consortium Science 282 2012 (1998).

S. A. Chervitz et al. Science 282 2022 (1998).

E. R. Kandel J. H. Schwartz T. Jessell Principles of Neural Science (McGraw-Hill New York ed. 4 2000).

10.1146/annurev.bi.65.070196.002355

10.1016/S0074-7696(00)96005-4

10.1002/1097-4695(200008)44:2<219::AID-NEU11>3.0.CO;2-W

10.1038/35039559

10.1007/978-1-4615-4685-6_22

10.1016/S0896-6273(00)00028-3

10.1146/annurev.neuro.21.1.75

10.1016/0166-2236(95)93898-8

10.1074/jbc.274.35.24453

B. Sampo et al. Proc. Natl. Acad. Sci. U.S.A. 97 3666 (2000).

10.1002/glia.440070402

M. Bernfield et al. Annu. Rev. Biochem. 68 729 (1999).

10.1038/35008000

10.1074/jbc.273.39.24979

J. L. Riechmann et al. Science 290 2105 (2000).

10.1074/jbc.274.36.25555

10.1016/S0955-0674(98)80042-2

10.1016/S0968-0004(98)01341-3

A. G. Uren et al. Mol. Cell 6 961 (2000).

10.1007/BF00357792

K. Meyer-Siegler et al. Proc. Natl. Acad. Sci. U.S.A. 88 8460 (1991).

10.1093/nar/21.4.993

10.1006/exnr.2000.7489

10.1101/gr.8.5.509

10.3109/08830189909088492

10.1093/nar/18.6.1513

10.1073/pnas.95.8.4463

10.1002/(SICI)1097-0177(199911)216:3<267::AID-DVDY5>3.0.CO;2-V

10.3109/03008200009005638

10.1038/5102

10.1126/science.1749935

10.1182/blood.V93.6.1798.406k22_1798_1808

10.1016/S1074-5521(00)00093-4

10.1101/gad.14.9.1027

10.1016/S1357-2725(98)00134-4

10.1126/science.281.5375.375

10.1126/science.287.5459.1809

10.1016/S0014-5793(00)01581-7

10.1515/znb-1967-1218

H. J. Muller in Heritage from Mendel R. A. Brink Ed. (Univ. of Wisconsin Press Madison WI 1967) p. 419.

J. F. Crow M. Kimura Introduction to Population Genetics Theory (Harper & Row New York 1970).

K. Kobayashi et al. Nature 394 388 (1998).

Feinberg A. P., Curr. Top. Microbiol. Immunol. 249, 87 (2000).

10.1038/79598

10.1016/S0959-437X(99)00022-2

10.1126/science.290.5497.1765

10.1016/S0168-9525(00)02106-5

10.1038/35040593

10.1080/00087114.1971.10796455

10.1016/S0022-5193(87)80172-8

10.1093/genetics/141.4.1619

10.1038/10794

10.1073/pnas.95.7.3731

10.1002/neu.480240610

10.1103/PhysRevLett.63.105

10.1002/(SICI)1099-0526(199609/10)2:1<44::AID-CPLX10>3.0.CO;2-X

10.1126/science.286.5439.509

10.1016/0092-8674(94)90553-3

J. Sambrook E. F. Fritch T. Maniatis Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Laboratory Press Cold Spring Harbor NY ed. 2 1989).

10.1101/gr.8.3.186

10.1101/gr.8.3.175

10.1016/0888-7543(88)90007-9

10.1006/jmbi.1994.1104

Sjölander K., Proc. Int. Soc. Mol. Biol. 6, 165 (1998).

10.1093/nar/28.1.45

GO available at www.geneontology.org/.

10.1093/nar/28.1.33

We thank E. Eichler and J. L. Goldstein for many helpful discussions and critical reading of the manuscript and A. Caplan for advice and encouragement. We also thank T. Hein D. Lucas G. Edwards and the Celera IT staff for outstanding computational support. The cost of this project was underwritten by the Celera Genomics Group of the Applera Corporation. We thank the Board of Directors of Applera Corporation: J. F. Abely Jr. (retired) R. H. Ayers J.-L. Bélingard R. H. Hayes A. J. Levine T. E. Martin C. W. Slayman O. R. Smith G. C. St. Laurent Jr. and J. R. Tobin for their vision enthusiasm and unwavering support and T. L. White for leadership and advice. Data availability: The genome sequence and additional supporting information are available to academic scientists at the Web site (www.celera.com). Instructions for obtaining a DVD of the genome sequence can be obtained through the Web site. For commercial scientists wishing to verify the results presented here the genome data are available upon signing a Material Transfer Agreement which can also be found on the Web site.