The Genome Sequence of Drosophila melanogaster

American Association for the Advancement of Science (AAAS) - Tập 287 Số 5461 - Trang 2185-2195 - 2000
Mark D. Adams1, S Celniker2, Robert A. Holt1, Cheryl Evans1, Jeannine D. Gocayne1, Peter G. Amanatides1, Steven E. Scherer3, Peter W. Li1, Ian Holmes2, Richard F. Galle2, Reed George2, Tim Hubbard4, Stephen M. Richards2, Michael Ashburner5, Scott N. Henderson1, Granger G. Sutton1, Jennifer R. Wortman1, Mark Yandell1, Qing Zhang1, Lin X. Chen1, Rhonda Brandon1, Yu-Hui Rogers1, Robert G. Blazej2, Mark Champe2, Barret D. Pfeiffer2, Kenneth H. Wan2, Clare Doyle2, Ellen Baxter2, Gregg Helt6, Catherine R. Nelson4, George L. Gabor, M Lengyel7, Josep F. Abril8, Anna Agbayani2, Hui-Jin An1, Cynthia Andrews‐Pfannkoch1, Danita Baldwin1, Richard M. Ballew1, Anand Basu1, James Baxendale1, Leyla Bayraktaroglu9, Ellen M. Beasley1, Karen Beeson1, Panayiotis V. Benos10, Benjamin P. Berman2, Deepali Bhandari1, Slava Bolshakov11, Dana Borkova12, Michael R. Botchan13, John Bouck3, Peter Brokstein4, Phillipe Brottier14, Kenneth C. Burtis15, Dana Busam1, H. Butler16, Édouard Cadieu17, Angela Center1, Ishwar Chandra1, J. Michael Cherry18, Simon Cawley19, Carl Dahlke1, Lionel B. Davenport1, Peter L. Davies1, Beatriz de Pablos20, Arthur L. Delcher1, Zuoming Deng1, Anne Deslattes Mays1, Ian Dew1, Susanne Dietz1, Kristina Dodson1, Lisa Doup1, Michael Downes21, Shannon Dugan-Rocha3, Boris C. Dunkov22, Patrick Dunn1, K. James Durbin3, Carlos Evangelista1, Concepción Ferraz23, Steven Ferriera1, Wolfgang Fleischmann5, Carl Fosler1, Andrei Gabrielian1, Neha Garg1, William M Gelbart9, Ken Glasser1, Anna Glodek1, Fangcheng Gong1, James H. Gorrell3, Zhiping Gu1, Ping Guan1, Michael A. Harris1,24, Nomi L. Harris2, Damon A. Harvey4, Thomas J. Heiman1, Judith Hernandez3, Jarrett Houck1, Damon Hostin1, K Houston2, Timothy J. Howland1, Minghui Wei1, Chinyere Ibegwam1, Mena Jalali1, Francis Kalush1, Gary H. Karpen21, Zhaoxi Ke1, James A. Kennison25, Karen A. Ketchum1, Bruce E. Kimmel2, Chinnappa D. Kodira1, Ðắc-Trung Nguyễn1, Saul Kravitz1, David Kulp6, Zhongwu Lai1, Paul Lasko26, Yiding Lei1, Alexander A. Levitsky1, Jiayin Li1, Zhenya Li1,27, Yong Liang1, Xiaoying Lin28, Xiangjun Liu1, Bettina Mattei1, Tina C. McIntosh1, Michael P. McLeod3, D McPherson1, Gennady V. Merkulov1, Natalia V. Milshina1, Clark Mobarry1, J. Glenn Morris6, Ali Moshrefi2, Stephen M. Mount29, Mee Moy1, Brian J. Murphy1, Lee Murphy30, Donna M. Muzny3, David L. Nelson3, David R. Nelson31, Keith A. Nelson1, Katherine Nixon2, Deborah Nusskern1, Joanne Pacleb2, Michael Palazzolo2, Gary S. Pittman1, Sue Pan1, John R. Pollard1, Vinita Puri1, Martin G. Reese4, Knut Reinert1, Karin Remington1, Robert D. C. Saunders32,33, Frederick Scheeler1, Hua Shen3, Byron Shue1, Inga Sidén‐Kiamos11, Michael A. Simpson1, Marian Skupski1, Tom Smith1, Eugene G. Spier1, Allan C. Spradling34, Mark Stapleton2, Renee Strong1, Eric I. Sun1, Robert Svirskas35, Cyndee Tector1, R. Turra1, Vera Lúcia da Silva Valente1, Aihui H. Wang1, Xin Wang1, Zhenyuan Wang1, David A. Wassarman36, George M. Weinstock3, Jean Weissenbach14, S. Williams1, Trevor Woodage1, Kim C. Worley3, David Wu1, Song Yang2, Qingping Yao1, Jane J. Ye1,37, Ru‐Fang Yeh19, Jayshree Zaveri1, Ming Zhan1, Guangren Zhang1, Qi Zhao1, Liansheng Zheng1, Xiangqun Zheng-Bradley1, Fei Zhong1, Wenyan Zhong1, Xiaojun Zhou3, Shiaoping C. Zhu1, Zhu Xiao-hong1, Hamilton O. Smith1, Richard A. Gibbs3, Eugene W. Myers1, Gerald M. Rubin38, J. Craig Venter1
1Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA
2Berkeley Drosophila Genome Project (BDGP), Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
3Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
4BDGP, Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA.
5European Molecular Biology Laboratory (EMBL) European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
6Neomorphic Inc., 2612 Eighth Street, Berkeley, CA 94710, USA.
7GenetixXpress Pty. Ltd., 78 Pacific Road, Palm Beach, Sydney, NSW 2108, Australia.
8Department of Medical Informatics, IMIM–UPF C/Dr. Aiguader 80, 08003 Barcelona, Spain.
9Department of Molecular and Cellular Biology, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138 USA
10Department of Genetics, Box 8232, Washington University Medical School, 4566 Scott Avenue, St. Louis, MO 63110, USA.
11Institute of Molecular Biology and Biotechnology, FORTH, Heraklion, Greece
12European Drosophila Genome Project (EDGP), EMBL, Heidelberg, Germany.
13Department of Molecular and Cell Biology, University of California, Berkeley, CA 94710, USA.
14Genoscope, 2 rue Gaston Crémieux, 91000 Evry, France.
15Section of Molecular and Cellular Biology, University of California, Davis, CA 95618, USA.
16Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK
17EDGP, Rennes University Medical School, UPR 41 CNRS Recombinaisons Genetiques, Faculte de Medicine, 2 av. du Pr. Leon Bernard, 35043 Rennes Cedex, France.
18Department of Genetics, Stanford University, Palo Alto, CA 94305, USA.
19Department of Statistics, University of California, Berkeley, CA 94720, USA
20EDGP, Centro de Biologı́a Molecular Severo Ochoa, CSIC, Universidad Autónoma de Madrid, 28049 Madrid, Spain.
21MBVL, Salk Institute, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA.
22Department of Biochemistry and Center for Insect Science, University of Arizona, Tucson, AZ 85721, USA
23EDGP, Montpellier University Medical School, Institut de Genetique Humaine, CNRS (CRBM), 114 rue de la Cardonille, 34396 Montpellier Cedex 5, France.
24Lawrence Berkeley National Laboratory, Berkeley, United States
25Laboratory of Molecular Genetics, National Institute of Child Health and Human Development, National Institutes of Health (NIH), Bethesda, MD 20892, USA.
26Department of Biology, McGill University, 1205 Avenue Docteur Penfield, Montreal, Quebec, Canada.
27J. Craig Venter Institute, La Jolla, United States
28The Institute for Genomic Research, Rockville, MD 20850, USA
29Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742 USA
30EDGP, Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
31Department of Biochemistry, University of Tennessee, Memphis, TN 38163 USA
32EDGP, Department of Anatomy and Physiology, University of Dundee, Dundee DD1 4HN, UK, and Department of Biological Sciences, Open University, Milton Keynes MK7 6AA, UK.
33University of Dundee, Dundee, United Kingdom
34HHMI/Embryology, Carnegie Institution of Washington, Baltimore, MD 21210, USA.
35Motorola (United States), Schaumburg, United States
36Cell Biology and Metabolism Branch, National Institute of Child Health and Human Development, NIH, Bethesda, MD 20892, USA.
37University of California, Berkeley, Berkeley, United States
38Howard Hughes Medical Institute, BDGP, University of California, Berkeley, CA 94720, USA.

Tóm tắt

The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the ∼120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes ∼13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.

Từ khóa


Tài liệu tham khảo

Miklos G. L. G., Rubin G. M., Cell 86, 521 (1996).

Spradling A. S., et al., Genetics 153, 135 (1999).

10.1093/genetics/153.1.179

10.1126/science.280.5369.1540

10.1126/science.287.5461.2216

Hartl D. L., et al., Trends Genet. 8, 70 (1992).

10.1126/science.7542800

Fraser C. M., Fleischmann R. D., Electrophoresis 18, 1207 (1997).

Weber J. L., Myers E. W., Genome Res. 7, 409 (1997).

10.1038/381364a0

10.1126/science.287.5461.2271

10.1126/science.287.5461.2196

A number of methods were used to close gaps. Whenever possible gaps were localized to a chromosome region and a spanning genomic clone was identified. When a spanning clone could be identified it was used as a template for sequencing. The sequencing approach was determined by the gap size. For gaps smaller than 1 kb BAC templates were sequenced directly with custom primers. For gaps larger than 1 kb 3-kb plasmids or M13 clones from the clone-based draft sequencing were sequenced by directed methods or 10-kb plasmids from the WGS sequencing project were sequenced by random transposon-based methods. If no 3-kb or 10-kb plasmid could be identified PCR products were amplified from BAC clones or genomic DNA and end-sequenced directly with the PCR primers.

Weiler K. S., Wakimoto B. T., Annu. Rev. Genet. 29, 577 (1995);

Henikoff S., Biochem. Biophys. Acta 1470, 1 (2000);

Pimpinelli S., et al., Proc. Natl. Acad. Sci. U.S.A. 92, 3804 (1995);

Lohe A. R., Hilliker A. J., Roberts P. A., Genetics 134, 1149 (1993) .

Miklos G. L. G., Yamamoto M., Davies J., Pirrotta V., Proc. Natl. Acad. Sci U.S.A. 85, 2051 (1988).

See ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/nuclear_cds_set.embl.v2.9.Z.

The genes found in unscaffolded sequence were Su(Ste) (FlyBase identifier FBgn0003582) on the Y chromosome His1 (FBgn0001195) and His4 (FBgn0001200) (histone genes were screened out before assembly) rbp13 (FBgn0014016) and idr (FBgn0020850).

10.1006/jmbi.1997.0951

M. G. Reese D. Kulp H. Tammana D. Haussler Genome Res. in press.

Sequence contigs were searched against publicly available sequence at the DNA level and as six-frame translations against public protein sequence data. DNA searches were against the invertebrate (INV) division of GenBank a set of 80 000 EST sequences produced at BDGP assembled to produce consensus sequences (21) and a set of curated Drosophila protein-coding genes prepared by three of the authors (M. Ashburner L. Bayraktaroglu and P. V. Benos) (15). Protein searches were performed against this set of curated protein sequences and against the nonredundant protein database available at the National Center for Biotechnology Information. Initial searches were performed with a version of BLAST2 (25) optimized for the Compaq Alpha architecture. Additional processing of each query-subject pair was performed to improve the alignments. All BLAST results having an expectation score of <1 × 10 −4 were then processed on the basis of their high-scoring pair (HSP) coordinates on the contig to remove redundant hits retaining hits that supported possible alternative splicing. This procedure was performed separately by hits to particular organisms so as not to exclude HSPs that support the same gene structure. Sequences producing BLAST hits judged to be informative nonredundant and sufficiently similar to the contig sequence were then realigned to the contig with Sim4 [

Florea L., Hartzell G., Zhang Z., Rubin G. M., Miller W., Genome Res. 8, 967 (1998);

] for ESTs and with Lap [

Huang X., Adams M. D., Zhou H., Kerlavage A. R., Genomics 46, 37 (1995);

] for proteins. Because both of these algorithms take splicing into account the resulting alignments usually respect intron-exon boundaries and thus facilitate human annotation. Some regions of the genome may be underannotated because the bulk of the annotation work was done on an earlier assembly version. Continued updates will be available through FlyBase.

M. G. Reese G. Hartzell N. L. Harris U. Ohler S. E. Lewis Genome Res. in press.

10.1126/science.287.5461.2222

See the Gene Ontology Web site (www.geneontology.org).

See the Saccharomyces Genome Database Web site ().

D. Allen and J. Blake Mouse Genome Informatics (www.informatics.jax.org).

10.1093/nar/25.17.3389

Mount S. M., et al., Nucleic Acids Res. 20, 4255 (1992).

The C. elegans Sequencing Consortium Science 282 2012 (1998).

10.1038/45471

10.1126/science.287.5461.2204

Dutta A., Bell S. P., Annu. Rev. Cell Dev. Biol. 13, 293 (1997).

Chesnokov I., Gossen M., Remus D., Botchan M., Genes Dev. 13, 1288 (1999).

Feger G., Gene 227, 149 (1999).

Pak D. T., et al., Cell 97, 311 (1997);

Rohrbough J., Pinto S., Mihalek R. M., Tully T., Broadie K., Neuron 23, 55 (1999).

Waga S., Hannon G. J., Beach D., Stillman B., Nature 369, 574 (1994);

Flores-Rozas H., et al., Proc. Natl. Acad. Sci. U.S.A. 91, 8655 (1994).

10.1016/S0959-437X(98)80149-4

Hirano T., Curr. Opin. Genet. Dev. 10, 317 (1998);

Strunnikov A. V., Trends Cell Biol. 8, 454 (1998).

10.1093/hmg/9.2.175

Craig J. M., Earnshaw W. C., Vagnarelli P., Exp. Cell Res. 246, 249 (1999);

Saffery R., et al., Chromosome Res. 7, 261 (1996).

Belotserkovskaya R., Berger S. L., Crit. Rev. Eukaryotic Gene Expr. 9, 221 (1999).

10.1093/nar/23.14.2715

Pollard K. J., Peterson C. L., Bioessays 20, 771 (1998).

Koonin E. V., Zhou S., Lucchesi J. C., Nucleic Acids Res. 23, 4229 (1995).

Jeanmougin F., et al., Trends Biochem. Sci. 22, 151 (1997);

10.1038/10640

Levis R. W., Mol. Gen. Genet. 236, 440 (1993);

Biessmann H., Mason J. M., Chromosoma 106, 63 (1997).

Gallinari P., Jiricny J., Nature 383, 735 (1996).

Flores B., Engels W., Proc. Natl. Acad. Sci. U.S.A. 96, 2964 (1999).

Kusano K., Berres M. E., Engels W. R., Genetics 151, 1027 (1999);

Sekelsky J. J., Brodsky M. H., Rubin G. M., Hawley R. S., Nucleic Acids Res. 27, 3762 (1999).

Hampsey M., Microbiol. Mol. Biol. Rev. 62, 465 (1998);

Reeder R. H., Prog. Nucleic Acid Res. Mol. Biol. 62, 293 (1999);

Willis I. M., Eur. J. Biochem. 212, 1 (1993).

Lee T. I., Young R. A., Genes Dev. 12, 1398 (1998);

Hampsey M., Reinberg D., Curr. Opin. Genet. Dev. 9, 132 (1999).

10.1073/pnas.96.9.4791

D. Duboule Ed. Guidebook to the Homeobox Genes (Oxford Univ. Press New York 1994).

Wool I. G., Trends Biochem. Sci. 21, 164 (1996).

Lambertsson A., Adv. Genet. 38, 69 (1998).

Jankowska-Anyszka M., et al., J. Biol. Chem. 273, 10538 (1998).

Culbertson M. R., Trends Genet. 15, 74 (1999).

C. Burge T. Tuschl P. Sharp in The RNA World R. Gesteland T. Cech J. Atkins Eds. (Cold Spring Harbor Laboratory Press Cold Spring Harbor NY ed. 2 1999).

Will C. L., Schneider C., Reed R., Luhrmann R., Science 284, 2003 (1999).

Feyereisen R., Annu. Rev. Entomol. 44, 507 (1999).

See D. Nelson's Web site ().

von Heijne G., J. Mol. Biol. 225, 487 (1992).

Hartenstein K., et al., Genetics 147, 1755 (1997).

Tearle R. G., Belote J. M., McKeown M., Baker B. S., Howells A. J., Genetics 122, 595 (1989).

Maleszka R., Microbiology 143, 1781 (1997).

Wang Q., Hasan G., Pikielny C. W., J. Biol. Chem. 274, 10309 (1999).

Dunkov B. C., Georgieva T., DNA Cell Biol. 18, 937 (1999).

Yoshiga T., et al., Eur. J. Biochem. 260, 414 (1999).

Kennard M. L., et al., EMBO J. 14, 4178 (1995).

High molecular weight genomic DNA was prepared from nuclei isolated [

Shaffer C. D., Wuller J. M., Elgin S. C. R., Methods Cell Biol. 44, 185 (1994);

] from 2.59 g of embryos of an isogenic y; cn bw sp strain [

10.1093/genetics/137.3.803

]. The genomic DNA was randomly sheared end-polished with Bal31 nuclease/T4 DNA polymerase and carefully size-selected on 1% low-melting-point agarose. After ligation to BstX1 adaptors genomic fragments were inserted into BstX1-linearized plasmid vector. Libraries of 1.8 ± 0.2 kb were cloned in a high-copy pUC18 derivative and libraries of 9.8 ± 1.0 10.5 ± 1.0 and 11.5 ± 1.0 kbp were cloned in a medium-copy pBR322 derivative. High-throughput methods in 384-well format were implemented for plasmid growth alkaline lysis plasmid purification and ABI Big Dye Terminator DNA sequencing reactions. Sequence reads from the genomic libraries were generated over a 4-month period using 300 DNA analyzers (ABI Prism 3700). These reads represent more than 12× coverage of the 120-Mbp euchromatic portion of the Drosophila genome (Table 1). Base-calling was performed using 3700 Data Collection (PE Biosystems) and sequence data were transferred to a Unix computer environment for further processing. Error probabilities were assigned to each base with TraceTuner software developed at Paracel Inc. (www.paracel.com). The predicted error probability was used to trim each sequence read such that the overall accuracy of each trimmed read was predicted to be >98.5% and no single 50-bp region was less than 97% accurate. The efficacy of TraceTuner and the trimming algorithm was demonstrated by comparing trimmed sequence reads to high-quality finished sequence data from BDGP (Fig. 2).

For clone-based genomic sequencing BAC P1 and cosmid DNAs were prepared by alkaline lysis procedures and purified by CsCl gradient ultracentrifugation. DNA was randomly sheared and size-selected on LMP agarose for fragments in the 3-kb range for plasmids and in the 2-kb range for M13 clones. After blunt-ending with T4 DNA polymerase plasmids were generated by ligation to BstX1 adaptors and insertion into BstX1-linearized pOT2A vector. M13 clones were generated using the double-adaptor protocol [

Andersson B., et al., Anal. Biochem. 236, 107 (1996);

]. Plasmid sequencing templates were prepared by alkaline lysis (Qiagen) or by PCR and M13 templates were prepared using the sodium perchlorate–glass fiber filter technique [

Andersson B., et al., Biotechniques 20, 1022 (1996);

]. Paired end-sequences of 3-kb plasmid subclones were generated (principally) with ABI Big Dye Terminator chemistry on ABI 377 slab gel or ABI 3700 capillary sequencers. Additional M13 subclone sequence was generated using BODIPY dye-labeled primers. Procedures for finishing sequence to high quality at LBNL were as described (3).

Yamamoto M.-T., et al., Genetics 125, 821 (1990).

J. F. Abril and R. Guigo Bioinformatics in press.

A. Peter et al. in preparation.

J. Locke L. Podemski N. Aippersbach H. Kemp R. Hodgetts in preparation.

The many participants from academic institutions are grateful for their various sources of support. We thank B. Thompson and his staff for the excellent laboratories and work environment M. Peterson and his team for computational support and V. Di Francesco S. Levy K. Chaturvedi D. Rusch C. Yan and V. Bonazzi for technical discussions and thoughtful advice. We are indebted to R. Guigo and to E. Lerner of Aquent Partners for assistance with illustrations. The work described was funded by Celera Genomics the Howard Hughes Medical Institute and NIH grant P50-HG00750 (G.M.R.).