Analysis of superfamily specific profile-profile recognition accuracy
Tóm tắt
Annotation of sequences that share little similarity to sequences of known function remains a major obstacle in genome annotation. Some of the best methods of detecting remote relationships between protein sequences are based on matching sequence profiles. We analyse the superfamily specific performance of sequence profile-profile matching. Our benchmark consists of a set of 16 protein superfamilies that are highly diverse at the sequence level. We relate the performance to the number of sequences in the profiles, the profile diversity and the extent of structural conservation in the superfamily. The performance varies greatly between superfamilies with the truncated receiver operating characteristic, ROC10, varying from 0.95 down to 0.01. These large differences persist even when the profiles are trimmed to approximately the same level of diversity. Although the number of sequences in the profile (profile width) and degree of sequence variation within positions in the profile (profile diversity) contribute to accurate detection there are other superfamily specific factors.
Tài liệu tham khảo
Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389
Sadreyev R, Grishin N: COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003, 326: 317–36. 10.1016/S0022-2836(02)01371-2
Sadreyev R, Baker D, Grishin N: Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 2003, 12(10):2262–72. 10.1110/ps.03197403
Tang C, Xie L, Koh I, Posy S, Alexov E, Honig B: On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J Mol Biol 2003, 334(5):1043–62. 10.1016/j.jmb.2003.10.025
Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315(5):1257–75. 10.1006/jmbi.2001.5293
Panchenko A: Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res 2003, 31(2):683–9. 10.1093/nar/gkg154
Sadreyev R, Grishin N: Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics 2004, 20: 818–28. 10.1093/bioinformatics/btg485
Chandonia J, Walker N, Lo Conte L, Koehl P, Levitt M, Brenner S: ASTRAL compendium enhancements. Nucleic Acids Res 2002, 30: 260–3. 10.1093/nar/30.1.260
Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–40. 10.1006/jmbi.1995.0159
Schaffer A, Aravind L, Madden T, Shavirin S, Spouge J, Wolf Y, Koonin E, Altschul S: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001, 29(14):2994–3005. 10.1093/nar/29.14.2994
Taylor W, Orengo C: Protein structure alignment. J Mol Biol 1989, 208: 1–22. 10.1016/0022-2836(89)90084-3
Taylor W: Protein structure comparison using SAP. Methods Mol Biol 2000, 143: 19–32.
Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205–17. 10.1006/jmbi.2000.4042
Casbon J, Saqi M: S4: Structure-based Sequence-alignments of Scop Superfamilies. To appear in Nucleic Acids Research Database Issue 2005.
Taylor W: The classification of amino acid conservation. J Theor Biol 1986, 119(2):205–18.
