Trình tự hoàn chỉnh của một chủng phân lập vi khuẩn Streptococcus pneumoniae gây bệnh

American Association for the Advancement of Science (AAAS) - Tập 293 Số 5529 - Trang 498-506 - 2001
Hervé Tettelin1, William Nelson1, Ian T. Paulsen2,1, Jonathan A. Eisen2,1, Timothy D. Read1, Scott N. Peterson3,1, John F. Heidelberg1, Robert T. DeBoy1, Daniel H. Haft1, Robert J. Dodson1, A. Scott Durkin1, Michelle Gwinn1, James F. Kolonay1, Jeremy Peterson1, Lowell Umayam1, Owen White1, Steven L. Salzberg4,1, Matthew R. Lewis1, Diana Radune1, Erik Holtzapple1, Hoda Khouri1, Alex M. Wolf1, Terry Utterback1, Charles Hansen1, Lisa McDonald1, Tamara Feldblyum1, Samuel V. Angiuoli1, Timothy A. Dickinson1, Erin K. Hickey1, Ingeborg Holt1, Brendan Loftus1, Fan Yang1, Hamilton O. Smith1, J. Craig Venter1, Brian Dougherty5, Donald A. Morrison6, Susan K. Hollingshead7, Claire M. Fraser3,1
1The Institute for Genomic Research (TIGR), 9712 Medical Center Drive, Rockville, MD 20850, USA
2Johns Hopkins University, Charles and 34th Streets, Baltimore, MD 21218, USA.
3George Washington University Medical Center, 2300 Eye Street, NW, Washington, DC 20037, USA
4Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA
5Bristol-Myers Squibb PRI, 5 Research Parkway, Wallingford CT 06492 USA
6University of Illinois at Chicago, 900 South Ashland Avenue, Chicago, IL 60607, USA.
7University of Alabama at Birmingham, 845 19th Street South, Birmingham, AL 35294, USA.

Tóm tắt

Trình tự bộ gen 2,160,837 cặp base của một chủng phân lập thuộc Streptococcus pneumoniae, một tác nhân gây bệnh Gram dương gây ra viêm phổi, nhiễm khuẩn huyết, viêm màng não, và viêm tai giữa, chứa 2236 vùng mã hóa dự đoán; trong số đó, 1440 (64%) đã được chỉ định vai trò sinh học. Khoảng 5% bộ gen là các trình tự chèn có thể góp phần vào sắp xếp lại bộ gen thông qua sự hấp thu DNA ngoại lai. Các hệ enzyme ngoại bào dùng cho việc chuyển hóa polysaccharide và hexosamine cung cấp nguồn carbon và nitơ phong phú cho S. pneumoniae và cũng gây tổn thương mô chủ và thúc đẩy quá trình định cư. Một motif được xác định trong peptide tín hiệu của các protein có thể tham gia vào việc định hướng các protein này lên bề mặt tế bào của các loài Gram dương có hàm lượng guanine/cytosine (GC) thấp. Một số protein trên bề mặt có thể đóng vai trò là ứng cử viên vaccine tiềm năng đã được nhận diện. So sánh lai tạo bộ gen bằng mảng DNA đã phát hiện các khác biệt giữa các chủng trong S. pneumoniae có thể đóng góp vào sự khác biệt về độc lực và tính kháng nguyên.

Từ khóa

#Streptococcus pneumoniae; trình tự bộ gen; vùng mã hóa; enzyme ngoại bào; motif peptide tín hiệu; protein bề mặt; ứng cử viên vaccine; khác biệt chủng loại; độc lực; tính kháng nguyên.

Tài liệu tham khảo

10.1084/jem.79.2.137

10.1098/rstb.1999.0430

10.1093/clinids/14.4.801

10.1056/NEJM199508243330810

10.3201/eid0506.990603

The TIGR4 isolate was previously referred to as JNR.7/87 the label of the clinical isolate [

10.1111/j.1574-6968.1999.tb13460.x

]; as KNR.7/87 [A. de Saizieu et al. J. Bacteriol. 182 4696 (2000); R. Hakenbeck et al. Infect. Immun. 69 2477 (2001)]; and as N4 [T. M. Wizemann et al. Infect. Immun. 69 1593 (2001)]. Midway through the sequencing project it became evident that one particular bacterial stock was contaminated with S. gordonii because reads from libraries made with DNA derived from this stock were composed entirely of non– S. pneumoniae sequences (assessed by using all available S. pneumoniae and S. gordonii sequences in GenBank) and would not assemble with the S. pneumoniae DNA. Because all aspects of the sequencing project are tracked through a relational database [R. D. Fleischmann et al. Science 269 496 (1995)] the problem was addressed by identifying and removing all the reads from the libraries in question from the project ( S. gordonii sequences are available on TIGR's Web site www.tigr.org/tdb/s_gordonii.shtml). The S. pneumoniae single-colony isolate that was grown for use in all subsequent libraries was named TIGR4.

Cloning sequencing and assembly were as described [W. C. Nierman et al. Proc. Natl. Acad. Sci. U.S.A. 98 4136 (2001)]. Four small insert (∼1.5 kb) shotgun libraries were constructed in pUC-derived vectors after random mechanical shearing (nebulization) of genomic DNA and three large insert (∼18 kb) shotgun libraries were constructed in λ-DASH II vectors (Stratagene) after partial Sau 3A digestion of genomic DNA. Sequencing of the small insert libraries was achieved at a success rate of 66% with an average read length of 518 bp. The first library constructed was nonrandom but improvement of the construction methods provided subsequent random libraries. In contrast none of the large insert libraries appeared to be completely random. Sequencing of these yielded the following success rates per library: first 366 nucleotides (nt) average length with a success rate of 26%; second 620 nt at 52%; and third 597 nt at 66%. In the late stages of closure the newly engineered TIGR vector pHOS2 (a pBR derivative) was used to construct a new large insert (∼9 kb) library. Sequencing rates were 508 nt at 48.5% success; these are low values but the library was substantially more random than the lambda libraries. 40 839 small insert and 3449 large insert end sequences were jointly assembled into 390 contigs larger than 1.5 kb (with 220 sequencing gaps and 170 physical gaps) using TIGR Assembler [

10.1089/gst.1995.1.9

]. The coverage criteria were that every position required at least double-clone coverage (or sequence from a PCR product amplified from genomic DNA) and either sequence from both strands or with two different sequencing chemistries. The sequence was edited manually with the TIGR Editor and additional PCR [

10.1006/geno.1999.6048

] and sequencing reactions were performed to close gaps improve coverage and resolve sequence ambiguities. Particularly difficult regions including SP1772 which contains 540 copies of a 24-bp imperfect repeat were covered by transposon-assisted sequencing (New England Biolabs pGPS Transposon Kit) and mapping of transposon insertions before assembly.

10.1016/S0882-4010(95)90125-6

Open reading frames (ORFs) likely to encode proteins were predicted by Glimmer [

10.1093/nar/26.2.544

10.1093/nar/27.23.4636

]. This program based on interpolated Markov models was trained with ORFs larger than 600 bp from the genomic sequence as well as with the S. pneumoniae genes available in GenBank. All predicted proteins larger than 30 amino acids were searched against a nonredundant protein database as previously described [

10.1126/science.7542800

]. Frameshifts and point mutations were detected and corrected where appropriate. Remaining frameshifts and point mutations are considered to be authentic and were annotated as “authentic frameshift” or “authentic point mutation.” Protein membrane–spanning domains were identified by TopPred [

Claros M. G., von Heijne G., Comput. Appl. Biosci. 10, 685 (1994);

]. The 5' regions of each ORF were inspected to define initiation codons using homologies position of ribosomal binding sites and transcriptional terminators. Two sets of hidden Markov models were used to determine ORF membership in families and superfamilies: pfam v5.5 [A. Bateman et al. Nucleic Acids Res. 28 263 (2000)] and TIGRFAMs 1.0 [D. H. Haft et al. Nucleic Acids Res. 29 41 (2001)]. Pfam v5.5 hidden Markov models were also used with a constraint of a minimum of two hits to find repeated domains within proteins and mask them. Domain-based paralogous families were then built by performing all-versus-all searches on the remaining protein sequences using a modified version of a previously described method [W. C. Nierman et al. Proc. Natl. Acad. Sci. U.S.A. 98 4136 (2001)]. The extent of potential lineage-specific gene duplications in this genome was estimated by identification of ORFs that are more similar to other ORFs within the TIGR4 genome than to ORFs from other complete genomes including those of plasmids organelles and phages. All ORFs were searched with FASTA3 against all ORFs from the complete genomes and matches with a FASTA p value of 10 −5 were considered significant.

Supplementary Web material is available on Science Online at www.sciencemag.org/cgi/content/full/293/5529/498/DC1.

C. Fraser et al. Science 270 397 (1995); F. Kunst et al. Nature 390 249 (1997); L. Banerjei personal communication; S. Gill personal communication.

10.1128/jb.173.22.7361-7367.1991

10.1111/j.1432-1033.1995.tb20293.x

10.1128/jb.177.8.1919-1928.1995

10.1128/mr.57.4.862-952.1993

10.1111/j.1365-2958.1993.tb01140.x

10.1099/00221287-145-10-2647

B. Martin et al. Nucleic Acids Res. 20 3479 (1992).

10.1073/pnas.95.23.13923

10.1006/jmbi.2000.3961

C. M. Fraser et al. Science 281 375 (1998).

10.1016/S0065-2911(00)42004-7

10.1046/j.1365-2958.2000.02122.x

10.1046/j.1365-2958.1997.5111879.x

10.1128/iai.64.12.5255-5262.1996

10.1128/AAC.44.9.2585-2587.2000

10.1016/S1369-5274(00)00167-3

W. B. Wood M. R. Smith J. Exp. Med. 90 (1949).

10.1006/jmbi.1994.1472

10.1084/jem.181.3.973

10.1111/j.1365-2958.1994.tb01076.x

10.1016/S0923-2508(00)00173-X

J. N. Weiser in Streptococcus pneumoniae —Molecular Biology and Mechanisms of Disease A. Tomasz Ed. (Mary Ann Liebert Larchmont NY 2000) pp. 245–252.

N. J. Saunders et al. Mol. Microbiol. 37 207 (2000).

Iterative DNA motifs including homopolymeric tracts were searched in the TIGR4 genome sequence using the REPEATS program [

10.1093/nar/22.22.4828

]. The minimum length of homopolymeric tracts was set at eight for A and T and at six for G and C; four tandem copies of di- and trinucleotides; and three copies of tetra- penta- and hexanucleotides. Heptanucleotides and above were not found in three or more copies except for the imperfect repeats in SP1772. The ratio of the observed frequency of homopolymeric tracts to their expected frequency was determined by means of Markov chain analysis as described [N. J. Saunders et al. Mol. Microbiol. 37 207 (2000)]. It revealed that G or C tracts of 8 bp and A or T tracts of 10 and 11 bp are slightly overrepresented.

J. O. Kim et al. Infect. Immun. 67 2327 (1999).

10.1073/pnas.92.20.9052

10.1016/S0921-8777(99)00050-6

Putative choline-binding motifs [J. L. Garcia A. R. Sanchez-Beato F. J. Medrano R. Lopez in Streptococcus pneumoniae— Molecular Biology and Mechanisms of Disease A. Tomasz Ed. (Mary Ann Liebert Larchmont NY 2000) pp. 231–244] were identified using Pfam hidden Markov model (HMM) PF01473 [A. Bateman et al. Nucleic Acids Res. 28 263 (2000)]. LPxTG-type Gram-positive anchor regions [

10.1016/S0966-842X(01)01956-4

] were detected by Pfam HMM PF00746 and by a new HMM built with HMMER 2.1.1 [

10.1093/bioinformatics/14.9.755

] from a new curated alignment of the surrounding region in S. pneumoniae. Candidate lipoprotein signal peptides [

10.1007/BF00763177

] were flagged by NH 2 -terminal exact matches to the pattern {DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C (35) culled of hypothetical proteins and cytosolic proteins aligned manually and used to generate a new HMM. Proteins matching both the HMM and the regular expression are predicted lipoproteins. Putative signal peptides were identified with SignalP [

10.1093/protein/10.1.1

The NH 2 -terminal regions of all proteins predicted to have signal sequences were collected for clustering and alignment with ClustalW and were scrutinized. A HMM based on an edited alignment of 40-residue segments around the (Y/F)SIRK motif found several hundred hits to a nonredundant amino acid database. A more general motif based on the larger family of YSIRK proteins is (Y/F)(S/A)(I/L)(R/K)(R/K)xxxGxxS (35).

Single-letter abbreviations for the amino acid residues are as follows: A Ala; C Cys; D Asp; E Glu; F Phe; G Gly; H His; I Ile; K Lys; L Leu; M Met; N Asn; P Pro; Q Gln; R Arg; S Ser; T Thr; V Val; W Trp; and Y Tyr.

10.1128/jb.174.22.7419-7427.1992

J. Davies et al. Infect. Immun. 63 2485 (1995).

This method is used to identify genomic differences between the TIGR4 strain and strains R6 and D39. All the predicted genes from the TIGR4 strain were amplified by PCR and arrayed on glass microscope slides as previously described [

10.1128/JB.182.21.6192-6202.2000

]. Genomic DNA for comparative genome hybridization studies was labeled according to protocols provided by J. DeRisi (www.microarrays.org/pdfs/GenomicDNALabel_B.pdf) except that genomic DNA was not digested or sheared before labeling. Arrays were scanned with a GenePix 4000B scanner from Axon (Union City CA) and individual hybridization signals were quantitated with TIGR SPOTFINDER [P. Hegde et al. Biotechniques 29 548 (2000)].

10.1128/jb.137.2.735-739.1979

Regions of atypical nucleotide composition were identified by the χ 2 analysis: The distribution of all 64 trinucleotides (trimers) was computed for the complete genome in all six reading frames followed by the trimer distribution in 2000-bp windows. Windows overlapped by 1500 bp. For each window the χ 2 statistic on the difference between its trimer content and that of the whole genome was computed. The most atypical regions with a score of 600 and above were considered in this analysis.

R. Hakenbeck et al. Infect. Immun. 69 2477 (2001).

T. M. Wizemann et al. Infect. Immun. 69 1593 (2001).

We thank M. Heaney J. Scott M. Holmes V. Sapiro B. Lee and B. Vincent for software and database support at TIGR; M. Ermolaeva and M. Pertea for specific computer analyses; the TIGR faculty and sequencing core for expert advice and assistance; I. Aaberge (National Institute of Public Health Oslo Norway) for providing the initial clinical isolate labeled JNR.7/87; and G. Zysk and A. Polissi for sharing specific sequence data not deposited in GenBank. Supported in part by the National Institutes of Allergy and Infectious Diseases (grant R01 AI40645-01A1) and the Merck Genome Research Institute (grant MGRI72).