Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis
Tóm tắt
Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical databases use multiple types of identifiers in parallel but lack a common primary key for reliable database consolidation. Connecting identifiers of analytes found in experimental data with the identifiers of their parent metabolites in public databases can therefore be very laborious. Here we present a strategy and a software tool for integrating metabolite identifiers from local reference libraries and public databases that do not depend on a single common primary identifier. The program constructs groups of interconnected identifiers of analytes and metabolites to obtain a local metabolite-centric SQLite database. The created database can be used to map in-house identifiers and synonyms to external resources such as the KEGG database. New identifiers can be imported and directly integrated with existing data. Queries can be performed in a flexible way, both from the command line and from the statistical programming environment R, to obtain data set tailored identifier mappings. Efficient cross-referencing of metabolite identifiers is a key technology for metabolomics data analysis. We provide a practical and flexible solution to this task and an open-source program, the metabolite masking tool (MetMask), available at
http://metmask.sourceforge.net
, that implements our ideas.
Tài liệu tham khảo
Tokimatsu T, Sakurai N, Suzuki H, Ohta H, Nishitani K, Koyama T, Umezawa T, Misawa N, Saito K, Shibata D: KaPPA-view: a web-based analysis tool for integration of transcript and metabolite data on plant metabolic pathway maps. Plant Physiol 2005, 138(3):1289–1300. 10.1104/pp.105.060525
Usadel B, Nagel A, Thimm O, Redestig H, Bläsing OE, Palacios-Rojas N, Selbig J, Hannemann J, Piques MC, Steinhauser D, Scheible WR, Gibon Y, Morcuende R, Weicht D, Meyer S, Stitt M: Extension of the visualization tool MapMan to allow statistical analysis of arrays, display of corresponding genes, and comparison with known responses. Plant Physiol 2005, 138(3):1195–1204. 10.1104/pp.105.060459
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34(3):267–73. 10.1038/ng1180
Redestig H, Repsilber D, Sohler F, Selbig J: Integrating functional knowledge during sample clustering for microarray data using unsupervised decision trees. Biom J 2007, 49(2):214–229. 10.1002/bimj.200610278
Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37: 1–13. 10.1093/nar/gkn923
Lenz EM, Wilson ID: Analytical strategies in metabonomics. J Proteome Res 2007, 6(2):443–458. 10.1021/pr0605217
Urano K, Maruyama K, Ogata Y, Morishita Y, Takeda M, Sakurai N, Suzuki H, Saito K, Shibata D, Kobayashi M, Yamaguchi-Shinozaki K, Shinozaki K: Characterization of the ABA-regulated global responses to dehydration in Arabidopsis by metabolomics. Plant J 2008, 57: 1065–1078. 10.1111/j.1365-313X.2008.03748.x
Werf MJ, Overkamp KM, Muilwijk B, Coulier L, Hankemeier T: Microbial metabolomics: toward a platform with full metabolome coverage. Anal Biochem 2007, 370: 17–25. 10.1016/j.ab.2007.07.022
Williams R, Lenz EM, Wilson AJ, Granger J, Wilson ID, Major H, Stumpf C, Plumb R: A multi-analytical platform approach to the metabonomic analysis of plasma from normal and Zucker (fa/fa) obese rats. Mol Biosyst 2006, 2(3–4):174–183. 10.1039/b516356k
Zhang J, Carey V, Gentleman R: An extensible application for assembling annotation for genomic data. Bioinformatics 2003, 19: 155–156. 10.1093/bioinformatics/19.1.155
Côté RG, Jones P, Martens L, Kerrien S, Reisinger F, Lin Q, Leinonen R, Apweiler R, Hermjakob H: The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 2007, 8: 401. 10.1186/1471-2105-8-401
Li H, Ding G, Xie L, Li Y: PAnnBuilder: an R package for assembling proteomic annotation data. Bioinformatics 2009, 25(8):1094–1095. 10.1093/bioinformatics/btp100
Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 2008, (36 Database):D344-D350.
KEGG Database: Kyoto Encyclopedia of Gene and Genomes.[http://www.genome.jp/kegg] [Compounds Database]
Cui Q, Lewis IA, Hegeman AD, Anderson ME, Li J, Schulte CF, Westler WM, Eghbalnia HR, Sussman MR, Markley JL: Metabolite identification via the Madison Metabolomics Consortium Database. Nat Biotechnol 2008, 26(2):162–164. 10.1038/nbt0208-162
Wishart DS: Human Metabolome Database: completing the 'human parts list'. Pharmacogenomics 2007, 8(7):683–686. 10.2217/14622416.8.7.683
Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider: a web server for automating metabolome annotations. Pac Symp Biocomput 2007, 145–156. full_text
Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart-biological queries made easy. BMC Genomics 2009, 10: 22. 10.1186/1471-2164-10-22
van Iersel MP, Pico AR, Kelder T, Gao J, Ho I, Hanspers K, Conklin BR, Evelo CT: The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 2010, 11: 5. 10.1186/1471-2105-11-5
Kind T, Scholz M, Fiehn O: How large is the metabolome? A critical analysis of data exchange practices in chemistry. PLoS ONE 2009, 4(5):e5440. 10.1371/journal.pone.0005440
Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, Moritz T, Lundgren K, Roessner-Tunali U, Forbes MG, Willmitzer L, Fernie AR, Kopka J: GC-MS libraries for the rapid identification of metabolites in complex biological samples. FEBS Lett 2005, 579(6):1332–1337. 10.1016/j.febslet.2005.01.029
PubChem: Database of chemical compounds.2009. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=pccompound] [Compounds]
Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc Database. Nucleic Acids Res 2002, 30: 56–58. 10.1093/nar/30.1.56
Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD: Computational prediction of human metabolic pathways from the complete human genome. Genome Biol 2005, 6: R2. 10.1186/gb-2004-6-1-r2
Plant Metabolic Network: PlantCyc.2009. [http://www.plantcyc.org] [Plant metabolite database]
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2004.
Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Li FLC, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80
Pages H, Carlson M, Falcon S, Li N: AnnotationDbi: Annotation Database Interface. 2009. [R package version 1.6.1] [R package version 1.6.1]
SQLite: Lightweight database.[http://www.sqlite.org] [Self-contained, zero-configuration]
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498–2504. 10.1101/gr.1239303
Carey VJ, Gentry J, Whalen E, Gentleman R: Network structures and algorithms in Bioconductor. Bioinformatics 2005, 21: 135–136. 10.1093/bioinformatics/bth458
Carrari F, Baxter C, Usadel B, Urbanczyk-Wochniak E, Zanor MI, Nunes-Nesi A, Nikiforova V, Centero D, Ratzka A, Pauly M, Sweetlove LJ, Fernie AR: Integrated analysis of metabolite and transcript levels reveals the metabolic shifts that underlie tomato fruit development and highlight regulatory aspects of metabolic network behavior. Plant Physiol 2006, 142(4):1380–1396. 10.1104/pp.106.088534
Akiyama K, Chikayama E, Yuasa H, Shimada Y, Tohge T, Shinozaki K, Hirai MY, Sakurai T, Kikuchi J, Saito K: PRIMe: a Web site that assembles tools for metabolomics and transcriptomics. In Silico Biol 2008, 8(3–4):339–345.
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008, (36 Database):D480-D484.
PubChem: Database of chemical substances.2009. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcsubstance] [Substance]
Shinbo Y, Nakamura Y, Altaf-Ul-Amin M, Asahi H, Kurokawa K, Arita M, Saito K, Ohta D, Shibata D, Kanaya S: In Plant Metabolomics, Springer 2006 chap. II.6: KNApSAcK: A Comprehensive Species-Metabolite Relationship Database. Edited by: Saito K, Dixon RA, Willmitzer L. 165–184.
Taguchi R, Nishijima M, Shimizu T: Basic analytical systems for lipidomics by mass spectrometry in Japan. Methods Enzymol 2007, 432: 185–211. full_text
Sud M, Fahy E, Cotter D, Brown A, Dennis EA, Glass CK, Merrill AH, Murphy RC, Raetz CRH, Russell DW, Subramaniam S: LMSD: LIPID MAPS structure database. Nucleic Acids Res 2007, (35 Database):D527-D532. 10.1093/nar/gkl838