Chuỗi gen hoàn chỉnh của Escherichia coli K-12

American Association for the Advancement of Science (AAAS) - Tập 277 Số 5331 - Trang 1453-1462 - 1997
Frederick R. Blattner1, Guy Plunkett1, C A Bloch1, Nicole T. Perna1, Valerie Burland1, Monica Riley1, Julio Collado‐Vides1, Jeremy D. Glasner1, Christopher K. Rode1, George F. Mayhew1, J. W. GREGOR1, N. Wayne Davis1, Heather Kirkpatrick1, Michael A. Goeden1, Debra J. Rose1, Bob Mau1, Ying Shao1
1F. R. Blattner, G. Plunkett III, N. T. Perna, J. D. Glasner, G. F. Mayhew, J. Gregor, N. W. Davis, H. A. Kirkpatrick, M. A. Goeden, D. J. Rose, B. Mau, and Y. Shao are at the Laboratory of Genetics, University of Wisconsin–Madison, 445 Henry Mall, Madison, WI 53706, USA. C. A. Bloch and C. K. Rode are in the Department of Pediatrics, University of Michigan School of Medicine, 1150 West Medical Center Drive, Ann Arbor, MI 48105, USA. V. Burland is at FMC Bioproducts, 191 Thomaston Street, Rockland,...

Tóm tắt

Chuỗi 4,639,221 cặp base của Escherichia coli K-12 được trình bày. Trong số 4288 gen mã hóa protein đã được chú thích, 38% không có chức năng xác định. So sánh với năm vi sinh vật đã giải trình tự khác cho thấy những gia đình gen phổ quát cũng như những gia đình gen phân bố hẹp; nhiều gia đình gen tương tự cũng thấy rõ trong E. coli . Gia đình lớn nhất của các protein paralog chứa 80 transporter ABC. Toàn bộ bộ gen được tổ chức một cách đáng chú ý liên quan đến hướng sao chép địa phương; các guanine, các oligonucleotide có thể liên quan đến sao chép và tái tổ hợp, và hầu hết các gen được định hướng như vậy. Bộ gen cũng chứa các yếu tố chuỗi chèn (IS), dấu tích phage, và nhiều mảng thành phần bất thường khác cho thấy tính dẻo dai của bộ gen thông qua chuyển giao ngang.

Từ khóa


Tài liệu tham khảo

10.1126/science.222.4625.719

. Escherichia coli has been the subject of extensive monographs the most recent of which is (2).

Escherichia coli and Salmonella Cellular and Molecular Biology F. C. Neidhardt et al. Eds. (ASM Press Washington DC 1996).

The publicly available complete genome sequences are those of Haemophilus influenzae Rd [

10.1126/science.7542800

] Mycoplasma genitalium [

Fraser C. M., et al., ibid. 270, 397 (1995);

] Methanococcus jannaschii [C. J. Bult et al. ibid. 273 1058 (1996)] Mycoplasma pneumoniae [

10.1093/nar/24.22.4420

] Synechocystis sp. strain PCC6803 [

10.1093/dnares/3.3.109

] and Saccharomyces cerevisiae [

10.1126/science.274.5287.546

Chuang S.-E., Daniels D. L., Blattner F. R., J. Bacteriol. 175, 2026 (1993);

; D. J. Lockart et al. Nature Biotechnol. 14 1675 (1996).

Riley M., Labedan B., J. Mol. Biol. 269, 1 (1997).

F. C. Neidhardt in (2) vol. 2 pp. 1–3.

B. Bachmann in (2) vol. 2 pp. 2460–2488.

Jensen K. F., J. Bacteriol. 175, 3401 (1993).

Lawther R. P., et al., ibid. 149, 294 (1982).

Liu D., Reeves P. R., Microbiology 140, 49 (1994).

Yura T., et al., Nucleic Acids Res. 20, 3305 (1992);

Fujita N., Mori H., Yura T., Ishihama A., ibid. 22, 1637 (1994);

Oshima T., et al., DNA Res. 3, 137 (1996);

; H. Aiba et al. ibid. p. 363; T. Itoh et al. ibid. p. 379.

Burland V., Daniels D. L., Plunkett G., Blattner F. R., Nucleic Acids Res. 21, 3385 (1993).

Six segments of the genome were sequenced using radioactive chemistry (14) [

Daniels D. L., Plunkett G., Burland V., Blattner F. R., Science 257, 771 (1992);

Plunkett G., Burland V., Daniels D. L., Blattner F. R., Nucleic Acids Res. 21, 3391 (1993);

; F. R. Blattner V. Burland G. Plunkett III H. J. Sofia D. L. Daniels ibid. p. 5408; H. J. Sofia V. Burland D. L. Daniels G. Plunkett III F. R. Blattner ibid. 22 2576 (1994); V. Burland G. Plunkett III H. J. Sofia D. L. Daniels F. R. Blattner ibid. 23 2105 (1995)]. We determined experimentally that deoxyinosine triphosphate (dITP) is the most effective analog for resolving G-C compressions although it also causes premature termination. With radioactive sequencing a dITP sequence lane must be run in addition to rather than in place of a deoxyguanosine triphosphate (dGTP) run. For efficiency in the areas of E. coli we sequenced radioactively tiling software was used to select a minimal set of M13 clones for resequencing with dITP after the bulk of the assembly had been completed with dGTP. On the other hand because prematurely terminated chains are not labeled by the fluorophore with dye-terminator fluorescent sequencing dITP can substitute totally for dGTP and can be used for all routine data collection.

Burland V., Plunkett G., Daniels D. L., Blattner F. R., Genomics 16, 551 (1993).

D. L. Daniels in The Bacterial Chromosome K. Drlica and M. Riley Eds. (American Society for Microbiology Washington DC 1990) pp. 43–51. It was often necessary to resequence overlapping regions between adjacent clones and screening to remove lambda vector sequences before sequencing was costly. Occasionally we found deleted mismapped or chimeric lambda clones that created unexpected gaps in genome coverage.

Although the 1-μg yield of popout plasmid [

Pósfai G., et al., Nucleic Acids Res. 22, 2392 (1994);

] was low for early shotgun protocols the assemblies were successful when supplemented with lambda clone and long-range PCR data. The main problem with extending this approach was the need to specifically engineer each popout plasmid by insertional recombination into the host.

I–Sce I is a site-specific intron-encoded homing endonuclease from yeast [

Perrin A., Buckle M., Dujon B., EMBO J. 12, 2939 (1993);

] whose 18-bp nonpalindromic recognition site is absent from E. coli (C. A. Bloch and C. K. Rode unpublished data). Single I–Sce I sites were introduced into MG1655 on a transposable element to produce a mapped collection of strains each with a unique I–Sce I site [

Rode C. K., Obreque V. H., Bloch C. A., Gene 166, 1 (1995);

Bloch C. A., Rode C. K., Obreque V. H., Mahillon J., Biochem. Biophys. Res. Commun. 223, 104 (1996);

]. P1 transduction was used to combine sites in pairs permitting isolation of I–Sce I fragments as single bands by pulsed-field gel electrophoresis. Sequencing confirmed the expected nine-base overlap between adjacent fragments. Although the background contamination for entire I–Sce I fragment shotguns ranged from 15 to 30% we occasionally observed individual preparative gels that seemed to have <5% background as assessed from gel images. We therefore suspect that improvements in gel handling and electrophoretic conditions could improve the overall quality of the fragment preparations.

Burland V., Curtis F. P., Kusukawa N., Biotechniques 21, 142 (1996).

Codon usage statistics [

Borodovsky M., McIninch J., Comput. Chem. 17, 123 (1993);

Gribskov M., Devereux J., Burgess R. R., Nucleic Acids Res. 12, 539 (1984);

] were graphically displayed by means of the program Geneplot (DNASTAR). Protein searches were to SWISS-PROT release 34 [

Bairoch A., Apweiler R., ibid. 24, 21 (1996);

]. The Link database is described in A. J. Link thesis Harvard University (1994). Signal peptide searches used an unpublished BASIC program written by F.R.B. Predictions for ribosomal binding sites were provided by W. S. Hayes and M. Borodovsky (personal communication).

Riley M., Nucleic Acids Res. 25, 51 (1997).

P. Karp M. Riley S. M. Paley A. Pellegrini-Toole M. Krummenacker ibid. p. 43.

Similarity searches were conducted using both the DeCypher II hardware-software system (Time Logic Inc. Incline Village NV) and the PepPepSearch program of the Darwin suite at Zurich [

10.1126/science.1604319

]. PepPepSearch returns up to 30 hit sequences per query and returns each pairwise alignment and the corresponding PAM scores. For most of the cases only matches with PAM < 200 were used. See

Labedan B., Riley M., Mol. Biol. Evol. 12, 980 (1995).

10.1016/S0022-2836(05)80360-2

Kashiwagi K., Yamaguchi Y., Sakai Y., Kobayashi H., Igarashi K., J. Biol. Chem. 265, 8387 (1990).

Lu Y., Flaherty C., Hendrickson W., ibid. 267, 24848 (1992).

Using the database of 392 known operons that we have localized in the genome sequence we first predicted operons on the basis of the functional class conservation within genes of an operon. This gives a better prediction (68% positive prediction) than the method of predicting operons on the basis of the distance of genes inside operons versus the distance between operons (59% positive prediction). We predicted 2281 operons by functional class conservation and predicted the remainder with unclassified genes using 50 bp as the distance criterion. The strategy found to give the highest number of positive promoter predictions (∼40% when tested with an independent set of known promoters) involves an initial search with a pair of weight matrices one for the –10 region and one for the –35 region. Candidate promoters using a low threshold of matches and 15 to 21 bp between –10 and –35 are saved. A subset of best candidates are selected on the basis of a context measure that compares alternative candidates within a given region of 200 bp upstream of each ORF. This includes a weight preference for candidates located closer to the beginning of the gene. The method can find zero one or several promoters in a single region. Inside operons we only saved promoters where regulatory sites were also found. Regulatory sites were searched with a combined weight matrix (when at least three sequences are known) and a string search that allows a fixed number of mismatches for each regulatory site. To avoid overrepresentation of particular sites we adjusted the number of allowed mismatches such that the number of predicted sites did not exceed 10 times the number of known sites for a given regulatory protein [

Rosenblueth D. A., Thieffry D., Huerta A. M., Salgado H., Collado-Vides J., Comput. Appl. Biosci. 12, 415 (1997)].

10.1093/nar/15.3.1281

Grosjean H., Fiers W., Gene 18, 199 (1982);

Ikemura T., Mol. Biol. Evol. 2, 13 (1985).

Médigue C., Rouxel T., Vigier P., Henaut A., Danchin A., J. Mol. Biol. 222, 851 (1991).

The zero reference (0/100 formerly 0/60) of the map was originally defined as the position of the first marker ( thr ) transferred by E. coli Hfr H which was used in genetic mapping by interrupted mating and a convention has arisen of using the first residue of the thrA gene as residue 1. However this results in placing the regulatory region of the thr operon at the opposite end of the 4.6-Mb sequence from the operon itself. We therefore defined nucleotide 1 as the A residue 189 nucleotides upstream of the initiation codon for thrL the first gene on the genetic map. We did not detect any feature spanning this point.

B. J. Brewer in The Bacterial Chromosome K. Drlica and M. Riley Eds. (American Society for Microbiology Washington DC 1990) pp. 61–83.

Wu C.-I., Maeda N., Nature 327, 169 (1987);

Perna N. T., Kocher T. D., J. Mol. Evol. 41, 353 (1995).

10.1093/oxfordjournals.molbev.a025626

; Science 272 745 (1996).

Cardon L. R., Burge C., Schachtel G. A., Blaisdell B. E., Karlin S., Nucleic Acids Res. 21, 3875 (1993);

Blaisdell B. E., Rudd K. E., Matin A., Karlin S., J. Mol. Biol. 229, 833 (1993).

Yoda K., Yasuda H., Xiang X. W., Okazaki T., Nucleic Acids Res. 16, 6531 (1988);

Hiasa H., et al., Gene 84, 9 (1989);

Yoda K., Okazaki T., Mol. Gen. Genet. 227, 1 (1991);

Swart J. R., Griep M. A., J. Biol. Chem. 268, 12970 (1993).

Wang T.-C. V., Chen S.-H., Biochem. Biophys. Res. Commun. 184, 1496 (1992);

; ibid. 198 844 (1994).

The major recombination pathway in E. coli is the RecBCD pathway so called because of the central involvement of the enzyme encoded by the recBCD genes. For a review of RecBCD-mediated recombination see

Stahl F., Myers R., J. Hered. 86, 327 (1995);

; see also (38). For a review of recombination-deficient variants of Chi see

Schultz D. W., Swindle J., Smith G. R., J. Mol. Biol. 146, 275 (1981).

Kuzminov A., Mol. Microbiol. 16, 373 (1995).

Burge C., Campbell A. M., Karlin S., Proc. Natl. Acad. Sci. U.S.A. 89, 1358 (1992);

McClelland M., Bhagwat A. S., Nature 355, 595 (1992);

Bhagwat A. S., McClelland M., Nucleic Acids Res. 20, 1663 (1992);

; R. Merkl M. Kroger P. Rice H. J. Fritz ibid. p. 1657; S. Karlin and L. R. Cardon Annu. Rev. Microbiol. 48 619 (1994).

Médigue C., Viari A., Hénaut A., Danchin A., Mol. Microbiol. 5, 2629 (1991).

Burlingame R. P., Wyman L., Chapman P. J., J. Bacteriol. 168, 55 (1986);

Bugg T. D. H., Biochim. Biophys. Acta 1202, 258 (1993);

Spence E., Kawamukai M., Sanvoisin J., Braven H., Bugg T., J. Bacteriol. 178, 5249 (1996).

Tan H. M., Tang H. Y., Joannou C. L., Abdel-Wahab N. H., Manson J. R., Gene 130, 33 (1993).

R. M. Macnab in (2) vol. 2 pp. 123–145;

Homma M., DeRosier D. J., Macnab R. M., J. Mol. Biol. 213, 819 (1990) ;

10.1128/jb.176.8.2272-2281.1994

; For a discussion of mviM and mviN see

Kutsukake K., Okada T., Yokoseki T., Iino T., Gene 143, 49 (1994).

For a discussion of ATT start in infC see

Sacerdot C., et al., EMBO J. 1, 311 (1982);

; for a discussion of CTG start in htgA see

Missiakas D., Georgopoulos C., Raina S., J. Bacteriol. 175, 2613 (1993).

Daniels D. L., Sanger F., Coulson A. R., Cold Spring Harbor Symp. Quant. Biol. 47, 1009 (1983);

10.1016/0022-2836(82)90546-0

A number of bacterial proteins have been implicated in mediating the invasion of host cells by pathogens. Attaching and effacing proteins are involved in eliciting an extensive rearrangement of host cell actin by enteropathogenic E. coli strains whereas invasins are bacterial surface proteins that provoke the endocytic uptake of Yersinia and Salmonella spp. by host cells. For an overview of bacterial pathogenesis including virulence factors see A. A. Salyers and D. D. Whitt Bacterial Pathogenesis: A Molecular Approach (ASM Press Washington DC 1994).

10.1128/mr.57.4.862-952.1993

___ and B. Labedan in (2) vol. 2 pp. 2118–2202.

Relations among these eubacteria are estimated by a rRNA phylogeny [

Olsen G. J., Woese C. R., Overbeek R., J. Bacteriol. 176, 1 (1994);

]. A previous estimate of 1128 Haemophilus influenzae orthologs among 75% of the complete E. coli genome [

Tatusov R. L., et al., Curr. Biol. 6, 279 (1996);

] is based on less restrictive criteria and includes sequences with as little as 18% identity.

Abdullah K. M., Lo R. Y., Mellors A., J. Bacteriol. 173, 5597 (1991).

S. Ohno Evolution by Gene Duplication (Springer-Verlag Berlin 1970).

J. D. Gralla and J. Collado-Vides in (2) vol. 1 pp. 1232–1244.

S. Bachellier E. Gilson M. Hofnung C. W. Hill in (2) vol. 2 pp. 2012–2040.

T. M. Hill in (2) vol. 2 pp. 1602–1612.

François V., Louarn J., Louarn J.-M., Mol. Microbiol. 3, 995 (1989).

Nakata A. M., Amemura M., Makino K., J. Bacteriol. 171, 3553 (1989).

R. C. Deonier in (2) vol. 2 pp. 2000–2011.

Matsutani S., Ohtsubo E., Gene 127, 111 (1993).

For a review of K-12 prophage see A. M. Campbell in (2) vol. 2 pp. 2041–2046. CP4-57 is described in

Retallack D. M., Johnson L. L., Friedman D. I., J. Bacteriol. 176, 2082 (1994);

; J. E. Kirby J. E. Trempy S. Gottesman ibid. p. 2068.

P22 [

Lindsey D. F., Martinez C., Walker J. R., J. Bacteriol. 174, 3834 (1992);

] and a phage from a clinical isolate [

Lim D., Mol. Microbiol. 6, 3531 (1992);

] also integrate into thrW.

Van Vliet F., Boyen A., Glansdorff N., Ann. Inst. Pasteur Microbiol. 139, 493 (1988).

E. Kofoid and J. Roth personal communication.

This is Laboratory of Genetics paper 3487. We thank the entire E. coli community for their support encouragement and sharing of data and especially D. L. Daniels and N. Peterson who were present at the creation. We also thank R. Straussburg and M. Guyer our program administrators; R. R. Burgess and M. Sussman for critical reading of the manuscript; M. Borodovsky and W. S. Hayes for application of a new version of the GeneMark program to the analysis of the sequence; K. Rudd for his Ecoseq7 melds of GenBank data; J. Mahillon for providing I–Sce I strains; J. Roth and E. Kofoid for unpublished Salmonella data; the Japanese group under H. Mori and T. Horiuchi for cooperative competition; G. Pósfai and W. Szybalski for the popout strains; S. Baldwin C. Allex N. Manola G. Bouriakov and J. Schroeder of DNASTAR for extraordinary software; A. Huerta H. Salgado and D. Thieffry for help with promoter operon and regulatory site identification; T. Thiesen for Postscript illustrations; H. Kijenski G. Peyrot P. Soni G. Diarra E. Grotbeck T. Forsythe M. Maguire M. Federle S. Subramanian and K. Kadner for excellent technical work; and 169 University of Wisconsin undergraduates who participated over the last decade. Supported by NIH grants P01 HG01428 (from the Human Genome Project) and S10 RR10379 (for ABI machines from the National Center for Research Resources–Biomedical Research Support Shared Instrumentation Grant). We thank IBM for the gift of workstations the State of Wisconsin for remodeling support and especially SmithKline Beecham Pharmaceuticals and Genome Therapeutics Corp. for financial support of the annotation of this sequence. N.P. is an NSF fellow in molecular evolution.