Trình tự hoàn chỉnh của một chủng phân lập vi khuẩn Streptococcus pneumoniae gây bệnh
Tóm tắt
Trình tự bộ gen 2,160,837 cặp base của một chủng phân lập thuộc
Từ khóa
#Streptococcus pneumoniae; trình tự bộ gen; vùng mã hóa; enzyme ngoại bào; motif peptide tín hiệu; protein bề mặt; ứng cử viên vaccine; khác biệt chủng loại; độc lực; tính kháng nguyên.Tài liệu tham khảo
The TIGR4 isolate was previously referred to as JNR.7/87 the label of the clinical isolate [
]; as KNR.7/87 [A. de Saizieu et al. J. Bacteriol. 182 4696 (2000); R. Hakenbeck et al. Infect. Immun. 69 2477 (2001)]; and as N4 [T. M. Wizemann et al. Infect. Immun. 69 1593 (2001)]. Midway through the sequencing project it became evident that one particular bacterial stock was contaminated with S. gordonii because reads from libraries made with DNA derived from this stock were composed entirely of non– S. pneumoniae sequences (assessed by using all available S. pneumoniae and S. gordonii sequences in GenBank) and would not assemble with the S. pneumoniae DNA. Because all aspects of the sequencing project are tracked through a relational database [R. D. Fleischmann et al. Science 269 496 (1995)] the problem was addressed by identifying and removing all the reads from the libraries in question from the project ( S. gordonii sequences are available on TIGR's Web site www.tigr.org/tdb/s_gordonii.shtml). The S. pneumoniae single-colony isolate that was grown for use in all subsequent libraries was named TIGR4.
Cloning sequencing and assembly were as described [W. C. Nierman et al. Proc. Natl. Acad. Sci. U.S.A. 98 4136 (2001)]. Four small insert (∼1.5 kb) shotgun libraries were constructed in pUC-derived vectors after random mechanical shearing (nebulization) of genomic DNA and three large insert (∼18 kb) shotgun libraries were constructed in λ-DASH II vectors (Stratagene) after partial Sau 3A digestion of genomic DNA. Sequencing of the small insert libraries was achieved at a success rate of 66% with an average read length of 518 bp. The first library constructed was nonrandom but improvement of the construction methods provided subsequent random libraries. In contrast none of the large insert libraries appeared to be completely random. Sequencing of these yielded the following success rates per library: first 366 nucleotides (nt) average length with a success rate of 26%; second 620 nt at 52%; and third 597 nt at 66%. In the late stages of closure the newly engineered TIGR vector pHOS2 (a pBR derivative) was used to construct a new large insert (∼9 kb) library. Sequencing rates were 508 nt at 48.5% success; these are low values but the library was substantially more random than the lambda libraries. 40 839 small insert and 3449 large insert end sequences were jointly assembled into 390 contigs larger than 1.5 kb (with 220 sequencing gaps and 170 physical gaps) using TIGR Assembler [
]. The coverage criteria were that every position required at least double-clone coverage (or sequence from a PCR product amplified from genomic DNA) and either sequence from both strands or with two different sequencing chemistries. The sequence was edited manually with the TIGR Editor and additional PCR [
] and sequencing reactions were performed to close gaps improve coverage and resolve sequence ambiguities. Particularly difficult regions including SP1772 which contains 540 copies of a 24-bp imperfect repeat were covered by transposon-assisted sequencing (New England Biolabs pGPS Transposon Kit) and mapping of transposon insertions before assembly.
Open reading frames (ORFs) likely to encode proteins were predicted by Glimmer [
]. This program based on interpolated Markov models was trained with ORFs larger than 600 bp from the genomic sequence as well as with the S. pneumoniae genes available in GenBank. All predicted proteins larger than 30 amino acids were searched against a nonredundant protein database as previously described [
]. Frameshifts and point mutations were detected and corrected where appropriate. Remaining frameshifts and point mutations are considered to be authentic and were annotated as “authentic frameshift” or “authentic point mutation.” Protein membrane–spanning domains were identified by TopPred [
Claros M. G., von Heijne G., Comput. Appl. Biosci. 10, 685 (1994);
]. The 5' regions of each ORF were inspected to define initiation codons using homologies position of ribosomal binding sites and transcriptional terminators. Two sets of hidden Markov models were used to determine ORF membership in families and superfamilies: pfam v5.5 [A. Bateman et al. Nucleic Acids Res. 28 263 (2000)] and TIGRFAMs 1.0 [D. H. Haft et al. Nucleic Acids Res. 29 41 (2001)]. Pfam v5.5 hidden Markov models were also used with a constraint of a minimum of two hits to find repeated domains within proteins and mask them. Domain-based paralogous families were then built by performing all-versus-all searches on the remaining protein sequences using a modified version of a previously described method [W. C. Nierman et al. Proc. Natl. Acad. Sci. U.S.A. 98 4136 (2001)]. The extent of potential lineage-specific gene duplications in this genome was estimated by identification of ORFs that are more similar to other ORFs within the TIGR4 genome than to ORFs from other complete genomes including those of plasmids organelles and phages. All ORFs were searched with FASTA3 against all ORFs from the complete genomes and matches with a FASTA p value of 10 −5 were considered significant.
Supplementary Web material is available on Science Online at www.sciencemag.org/cgi/content/full/293/5529/498/DC1.
C. Fraser et al. Science 270 397 (1995); F. Kunst et al. Nature 390 249 (1997); L. Banerjei personal communication; S. Gill personal communication.
B. Martin et al. Nucleic Acids Res. 20 3479 (1992).
C. M. Fraser et al. Science 281 375 (1998).
J. N. Weiser in Streptococcus pneumoniae —Molecular Biology and Mechanisms of Disease A. Tomasz Ed. (Mary Ann Liebert Larchmont NY 2000) pp. 245–252.
Iterative DNA motifs including homopolymeric tracts were searched in the TIGR4 genome sequence using the REPEATS program [
]. The minimum length of homopolymeric tracts was set at eight for A and T and at six for G and C; four tandem copies of di- and trinucleotides; and three copies of tetra- penta- and hexanucleotides. Heptanucleotides and above were not found in three or more copies except for the imperfect repeats in SP1772. The ratio of the observed frequency of homopolymeric tracts to their expected frequency was determined by means of Markov chain analysis as described [N. J. Saunders et al. Mol. Microbiol. 37 207 (2000)]. It revealed that G or C tracts of 8 bp and A or T tracts of 10 and 11 bp are slightly overrepresented.
Putative choline-binding motifs [J. L. Garcia A. R. Sanchez-Beato F. J. Medrano R. Lopez in Streptococcus pneumoniae— Molecular Biology and Mechanisms of Disease A. Tomasz Ed. (Mary Ann Liebert Larchmont NY 2000) pp. 231–244] were identified using Pfam hidden Markov model (HMM) PF01473 [A. Bateman et al. Nucleic Acids Res. 28 263 (2000)]. LPxTG-type Gram-positive anchor regions [
] were detected by Pfam HMM PF00746 and by a new HMM built with HMMER 2.1.1 [
] from a new curated alignment of the surrounding region in S. pneumoniae. Candidate lipoprotein signal peptides [
] were flagged by NH 2 -terminal exact matches to the pattern {DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C (35) culled of hypothetical proteins and cytosolic proteins aligned manually and used to generate a new HMM. Proteins matching both the HMM and the regular expression are predicted lipoproteins. Putative signal peptides were identified with SignalP [
The NH 2 -terminal regions of all proteins predicted to have signal sequences were collected for clustering and alignment with ClustalW and were scrutinized. A HMM based on an edited alignment of 40-residue segments around the (Y/F)SIRK motif found several hundred hits to a nonredundant amino acid database. A more general motif based on the larger family of YSIRK proteins is (Y/F)(S/A)(I/L)(R/K)(R/K)xxxGxxS (35).
Single-letter abbreviations for the amino acid residues are as follows: A Ala; C Cys; D Asp; E Glu; F Phe; G Gly; H His; I Ile; K Lys; L Leu; M Met; N Asn; P Pro; Q Gln; R Arg; S Ser; T Thr; V Val; W Trp; and Y Tyr.
This method is used to identify genomic differences between the TIGR4 strain and strains R6 and D39. All the predicted genes from the TIGR4 strain were amplified by PCR and arrayed on glass microscope slides as previously described [
]. Genomic DNA for comparative genome hybridization studies was labeled according to protocols provided by J. DeRisi (www.microarrays.org/pdfs/GenomicDNALabel_B.pdf) except that genomic DNA was not digested or sheared before labeling. Arrays were scanned with a GenePix 4000B scanner from Axon (Union City CA) and individual hybridization signals were quantitated with TIGR SPOTFINDER [P. Hegde et al. Biotechniques 29 548 (2000)].
Regions of atypical nucleotide composition were identified by the χ 2 analysis: The distribution of all 64 trinucleotides (trimers) was computed for the complete genome in all six reading frames followed by the trimer distribution in 2000-bp windows. Windows overlapped by 1500 bp. For each window the χ 2 statistic on the difference between its trimer content and that of the whole genome was computed. The most atypical regions with a score of 600 and above were considered in this analysis.
We thank M. Heaney J. Scott M. Holmes V. Sapiro B. Lee and B. Vincent for software and database support at TIGR; M. Ermolaeva and M. Pertea for specific computer analyses; the TIGR faculty and sequencing core for expert advice and assistance; I. Aaberge (National Institute of Public Health Oslo Norway) for providing the initial clinical isolate labeled JNR.7/87; and G. Zysk and A. Polissi for sharing specific sequence data not deposited in GenBank. Supported in part by the National Institutes of Allergy and Infectious Diseases (grant R01 AI40645-01A1) and the Merck Genome Research Institute (grant MGRI72).