Improving gene annotation using peptide mass spectrometry

Genome Research - Tập 17 Số 2 - Trang 231-239 - 2007
Stephen Tanner1, Zhouxin Shen2, John N. Ng2, Liliana Florea3, Roderic Guigó4, Steven P. Briggs2, Vineet Bafna2
1Bioinformatics Program, University of California, San Diego, La Jolla, California 92093-0419, USA. [email protected]
2University of california at San Diego
3(George Washington University)
4Centre for Genomic Regulation

Tóm tắt

Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.

Từ khóa


Tài liệu tham khảo

10.1038/nature01511

10.1145/360825.360855

10.1093/bioinformatics/17.1.13

Blanco, E. Parra, G. Guigó, R. (2002) Current Protocols in Bioinformatics, Using GeneID to identify genes (John Wiley & Sons Inc. New York) Unit 4.3..

Boguski,, 1993, Gene discovery in dbEST, Science, 265, 1993, 10.1126/science.8091218

10.1038/nature01099

10.1002/1615-9861(200104)1:5<651::AID-PROT651>3.0.CO;2-N

10.1002/rcm.1198

10.1002/1615-9861(200210)2:10<1426::AID-PROT1426>3.0.CO;2-5

Desiere,, 2004, Integration of peptide sequences obtained by high-throughput mass spectrometry with the human genome, Genome Biol., 1, R9, 10.1186/gb-2004-6-1-r9

10.1073/pnas.0506958103

Edwards, N. Lippert, R. (2004) The 4th Workshop on Algorithms in Bioinformatics (WABI) Sequence database compression for peptide identification from tandem mass spectra (Bergen, Norway).

10.1126/science.1105136

10.1186/gb-2006-7-4-r35

10.1101/gr.2889405

10.1021/pr050011x

10.1002/pmic.200401051

10.1186/1471-2164-5-72

10.1038/ng0405-331

Heber,, 2002, Splicing graphs and EST assembly problem, Bioinformatics, 18, S181, 10.1093/bioinformatics/18.suppl_1.S181

10.1038/nrm1939

Keller,, 2002, Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search, Anal. Chem., 74, 5383, 10.1021/ac025747h

10.1002/pmic.200300721

Korf,, 2001, Integrating genomic homology into gene structure prediction, Bioinformatics, 17, S140, 10.1093/bioinformatics/17.suppl_1.S140

10.1002/1615-9861(200104)1:5<641::AID-PROT641>3.0.CO;2-R

10.1093/nar/gkh731

10.1002/mas.10048

Lu,, 2003, A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications, Bioinformatics, 19, 113, 10.1093/bioinformatics/btg1068

10.1101/gr.9.12.1288

10.1038/ng0102-13

10.1002/pmic.200500358

10.1101/gr.10.4.511

10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2

10.1093/nar/gki025

Resing,, 2004, Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics, Anal. Chem., 76, 3556, 10.1021/ac035229m

10.1074/mcp.M500064-MCP200

Sadygov,, 2003, A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases, Anal. Chem., 75, 3792, 10.1021/ac034157w

Tabb,, 2003, Statistical characterization of ion trap tandem mass spectra from doubly charged tryptic peptides, Anal. Chem., 75, 1155, 10.1021/ac026122m

10.1007/s10038-005-0261-9

Tanner,, 2005, Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra, Anal. Chem., 77, 4626, 10.1021/ac050102d

10.1038/nbt1168

10.1038/nature01262

Yates,, 1995, Mining genomes: Correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases, Anal. Chem., 67, 3202, 10.1021/ac00114a016

Yates,, 1995, Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database, Anal. Chem., 67, 1426, 10.1021/ac00104a020