Chuỗi gen hoàn chỉnh của Escherichia coli K-12
Tóm tắt
Chuỗi 4,639,221 cặp base của
Từ khóa
Tài liệu tham khảo
. Escherichia coli has been the subject of extensive monographs the most recent of which is (2).
Escherichia coli and Salmonella Cellular and Molecular Biology F. C. Neidhardt et al. Eds. (ASM Press Washington DC 1996).
The publicly available complete genome sequences are those of Haemophilus influenzae Rd [
] Mycoplasma genitalium [
Fraser C. M., et al., ibid. 270, 397 (1995);
] Methanococcus jannaschii [C. J. Bult et al. ibid. 273 1058 (1996)] Mycoplasma pneumoniae [
] Synechocystis sp. strain PCC6803 [
] and Saccharomyces cerevisiae [
; D. J. Lockart et al. Nature Biotechnol. 14 1675 (1996).
F. C. Neidhardt in (2) vol. 2 pp. 1–3.
B. Bachmann in (2) vol. 2 pp. 2460–2488.
Lawther R. P., et al., ibid. 149, 294 (1982).
Fujita N., Mori H., Yura T., Ishihama A., ibid. 22, 1637 (1994);
; H. Aiba et al. ibid. p. 363; T. Itoh et al. ibid. p. 379.
Six segments of the genome were sequenced using radioactive chemistry (14) [
; F. R. Blattner V. Burland G. Plunkett III H. J. Sofia D. L. Daniels ibid. p. 5408; H. J. Sofia V. Burland D. L. Daniels G. Plunkett III F. R. Blattner ibid. 22 2576 (1994); V. Burland G. Plunkett III H. J. Sofia D. L. Daniels F. R. Blattner ibid. 23 2105 (1995)]. We determined experimentally that deoxyinosine triphosphate (dITP) is the most effective analog for resolving G-C compressions although it also causes premature termination. With radioactive sequencing a dITP sequence lane must be run in addition to rather than in place of a deoxyguanosine triphosphate (dGTP) run. For efficiency in the areas of E. coli we sequenced radioactively tiling software was used to select a minimal set of M13 clones for resequencing with dITP after the bulk of the assembly had been completed with dGTP. On the other hand because prematurely terminated chains are not labeled by the fluorophore with dye-terminator fluorescent sequencing dITP can substitute totally for dGTP and can be used for all routine data collection.
D. L. Daniels in The Bacterial Chromosome K. Drlica and M. Riley Eds. (American Society for Microbiology Washington DC 1990) pp. 43–51. It was often necessary to resequence overlapping regions between adjacent clones and screening to remove lambda vector sequences before sequencing was costly. Occasionally we found deleted mismapped or chimeric lambda clones that created unexpected gaps in genome coverage.
Although the 1-μg yield of popout plasmid [
] was low for early shotgun protocols the assemblies were successful when supplemented with lambda clone and long-range PCR data. The main problem with extending this approach was the need to specifically engineer each popout plasmid by insertional recombination into the host.
I–Sce I is a site-specific intron-encoded homing endonuclease from yeast [
] whose 18-bp nonpalindromic recognition site is absent from E. coli (C. A. Bloch and C. K. Rode unpublished data). Single I–Sce I sites were introduced into MG1655 on a transposable element to produce a mapped collection of strains each with a unique I–Sce I site [
Bloch C. A., Rode C. K., Obreque V. H., Mahillon J., Biochem. Biophys. Res. Commun. 223, 104 (1996);
]. P1 transduction was used to combine sites in pairs permitting isolation of I–Sce I fragments as single bands by pulsed-field gel electrophoresis. Sequencing confirmed the expected nine-base overlap between adjacent fragments. Although the background contamination for entire I–Sce I fragment shotguns ranged from 15 to 30% we occasionally observed individual preparative gels that seemed to have <5% background as assessed from gel images. We therefore suspect that improvements in gel handling and electrophoretic conditions could improve the overall quality of the fragment preparations.
Codon usage statistics [
] were graphically displayed by means of the program Geneplot (DNASTAR). Protein searches were to SWISS-PROT release 34 [
Bairoch A., Apweiler R., ibid. 24, 21 (1996);
]. The Link database is described in A. J. Link thesis Harvard University (1994). Signal peptide searches used an unpublished BASIC program written by F.R.B. Predictions for ribosomal binding sites were provided by W. S. Hayes and M. Borodovsky (personal communication).
P. Karp M. Riley S. M. Paley A. Pellegrini-Toole M. Krummenacker ibid. p. 43.
Similarity searches were conducted using both the DeCypher II hardware-software system (Time Logic Inc. Incline Village NV) and the PepPepSearch program of the Darwin suite at Zurich [
]. PepPepSearch returns up to 30 hit sequences per query and returns each pairwise alignment and the corresponding PAM scores. For most of the cases only matches with PAM < 200 were used. See
Labedan B., Riley M., Mol. Biol. Evol. 12, 980 (1995).
Lu Y., Flaherty C., Hendrickson W., ibid. 267, 24848 (1992).
Using the database of 392 known operons that we have localized in the genome sequence we first predicted operons on the basis of the functional class conservation within genes of an operon. This gives a better prediction (68% positive prediction) than the method of predicting operons on the basis of the distance of genes inside operons versus the distance between operons (59% positive prediction). We predicted 2281 operons by functional class conservation and predicted the remainder with unclassified genes using 50 bp as the distance criterion. The strategy found to give the highest number of positive promoter predictions (∼40% when tested with an independent set of known promoters) involves an initial search with a pair of weight matrices one for the –10 region and one for the –35 region. Candidate promoters using a low threshold of matches and 15 to 21 bp between –10 and –35 are saved. A subset of best candidates are selected on the basis of a context measure that compares alternative candidates within a given region of 200 bp upstream of each ORF. This includes a weight preference for candidates located closer to the beginning of the gene. The method can find zero one or several promoters in a single region. Inside operons we only saved promoters where regulatory sites were also found. Regulatory sites were searched with a combined weight matrix (when at least three sequences are known) and a string search that allows a fixed number of mismatches for each regulatory site. To avoid overrepresentation of particular sites we adjusted the number of allowed mismatches such that the number of predicted sites did not exceed 10 times the number of known sites for a given regulatory protein [
Rosenblueth D. A., Thieffry D., Huerta A. M., Salgado H., Collado-Vides J., Comput. Appl. Biosci. 12, 415 (1997)].
Ikemura T., Mol. Biol. Evol. 2, 13 (1985).
The zero reference (0/100 formerly 0/60) of the map was originally defined as the position of the first marker ( thr ) transferred by E. coli Hfr H which was used in genetic mapping by interrupted mating and a convention has arisen of using the first residue of the thrA gene as residue 1. However this results in placing the regulatory region of the thr operon at the opposite end of the 4.6-Mb sequence from the operon itself. We therefore defined nucleotide 1 as the A residue 189 nucleotides upstream of the initiation codon for thrL the first gene on the genetic map. We did not detect any feature spanning this point.
B. J. Brewer in The Bacterial Chromosome K. Drlica and M. Riley Eds. (American Society for Microbiology Washington DC 1990) pp. 61–83.
Cardon L. R., Burge C., Schachtel G. A., Blaisdell B. E., Karlin S., Nucleic Acids Res. 21, 3875 (1993);
The major recombination pathway in E. coli is the RecBCD pathway so called because of the central involvement of the enzyme encoded by the recBCD genes. For a review of RecBCD-mediated recombination see
; see also (38). For a review of recombination-deficient variants of Chi see
McClelland M., Bhagwat A. S., Nature 355, 595 (1992);
; R. Merkl M. Kroger P. Rice H. J. Fritz ibid. p. 1657; S. Karlin and L. R. Cardon Annu. Rev. Microbiol. 48 619 (1994).
R. M. Macnab in (2) vol. 2 pp. 123–145;
; For a discussion of mviM and mviN see
For a discussion of ATT start in infC see
; for a discussion of CTG start in htgA see
A number of bacterial proteins have been implicated in mediating the invasion of host cells by pathogens. Attaching and effacing proteins are involved in eliciting an extensive rearrangement of host cell actin by enteropathogenic E. coli strains whereas invasins are bacterial surface proteins that provoke the endocytic uptake of Yersinia and Salmonella spp. by host cells. For an overview of bacterial pathogenesis including virulence factors see A. A. Salyers and D. D. Whitt Bacterial Pathogenesis: A Molecular Approach (ASM Press Washington DC 1994).
___ and B. Labedan in (2) vol. 2 pp. 2118–2202.
Relations among these eubacteria are estimated by a rRNA phylogeny [
]. A previous estimate of 1128 Haemophilus influenzae orthologs among 75% of the complete E. coli genome [
] is based on less restrictive criteria and includes sequences with as little as 18% identity.
J. D. Gralla and J. Collado-Vides in (2) vol. 1 pp. 1232–1244.
S. Bachellier E. Gilson M. Hofnung C. W. Hill in (2) vol. 2 pp. 2012–2040.
T. M. Hill in (2) vol. 2 pp. 1602–1612.
R. C. Deonier in (2) vol. 2 pp. 2000–2011.
For a review of K-12 prophage see A. M. Campbell in (2) vol. 2 pp. 2041–2046. CP4-57 is described in
; J. E. Kirby J. E. Trempy S. Gottesman ibid. p. 2068.
P22 [
] and a phage from a clinical isolate [
] also integrate into thrW.
E. Kofoid and J. Roth personal communication.
This is Laboratory of Genetics paper 3487. We thank the entire E. coli community for their support encouragement and sharing of data and especially D. L. Daniels and N. Peterson who were present at the creation. We also thank R. Straussburg and M. Guyer our program administrators; R. R. Burgess and M. Sussman for critical reading of the manuscript; M. Borodovsky and W. S. Hayes for application of a new version of the GeneMark program to the analysis of the sequence; K. Rudd for his Ecoseq7 melds of GenBank data; J. Mahillon for providing I–Sce I strains; J. Roth and E. Kofoid for unpublished Salmonella data; the Japanese group under H. Mori and T. Horiuchi for cooperative competition; G. Pósfai and W. Szybalski for the popout strains; S. Baldwin C. Allex N. Manola G. Bouriakov and J. Schroeder of DNASTAR for extraordinary software; A. Huerta H. Salgado and D. Thieffry for help with promoter operon and regulatory site identification; T. Thiesen for Postscript illustrations; H. Kijenski G. Peyrot P. Soni G. Diarra E. Grotbeck T. Forsythe M. Maguire M. Federle S. Subramanian and K. Kadner for excellent technical work; and 169 University of Wisconsin undergraduates who participated over the last decade. Supported by NIH grants P01 HG01428 (from the Human Genome Project) and S10 RR10379 (for ABI machines from the National Center for Research Resources–Biomedical Research Support Shared Instrumentation Grant). We thank IBM for the gift of workstations the State of Wisconsin for remodeling support and especially SmithKline Beecham Pharmaceuticals and Genome Therapeutics Corp. for financial support of the annotation of this sequence. N.P. is an NSF fellow in molecular evolution.