Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis

BMC Bioinformatics - Tập 11 - Trang 1-11 - 2010
Henning Redestig1, Miyako Kusano1, Atsushi Fukushima1, Fumio Matsuda1, Kazuki Saito1, Masanori Arita1
1Metabolomics Research Group, RIKEN Plant Science Center, Yokohama, Japan

Tóm tắt

Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical databases use multiple types of identifiers in parallel but lack a common primary key for reliable database consolidation. Connecting identifiers of analytes found in experimental data with the identifiers of their parent metabolites in public databases can therefore be very laborious. Here we present a strategy and a software tool for integrating metabolite identifiers from local reference libraries and public databases that do not depend on a single common primary identifier. The program constructs groups of interconnected identifiers of analytes and metabolites to obtain a local metabolite-centric SQLite database. The created database can be used to map in-house identifiers and synonyms to external resources such as the KEGG database. New identifiers can be imported and directly integrated with existing data. Queries can be performed in a flexible way, both from the command line and from the statistical programming environment R, to obtain data set tailored identifier mappings. Efficient cross-referencing of metabolite identifiers is a key technology for metabolomics data analysis. We provide a practical and flexible solution to this task and an open-source program, the metabolite masking tool (MetMask), available at http://metmask.sourceforge.net , that implements our ideas.

Tài liệu tham khảo

Tokimatsu T, Sakurai N, Suzuki H, Ohta H, Nishitani K, Koyama T, Umezawa T, Misawa N, Saito K, Shibata D: KaPPA-view: a web-based analysis tool for integration of transcript and metabolite data on plant metabolic pathway maps. Plant Physiol 2005, 138(3):1289–1300. 10.1104/pp.105.060525 Usadel B, Nagel A, Thimm O, Redestig H, Bläsing OE, Palacios-Rojas N, Selbig J, Hannemann J, Piques MC, Steinhauser D, Scheible WR, Gibon Y, Morcuende R, Weicht D, Meyer S, Stitt M: Extension of the visualization tool MapMan to allow statistical analysis of arrays, display of corresponding genes, and comparison with known responses. Plant Physiol 2005, 138(3):1195–1204. 10.1104/pp.105.060459 Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34(3):267–73. 10.1038/ng1180 Redestig H, Repsilber D, Sohler F, Selbig J: Integrating functional knowledge during sample clustering for microarray data using unsupervised decision trees. Biom J 2007, 49(2):214–229. 10.1002/bimj.200610278 Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37: 1–13. 10.1093/nar/gkn923 Lenz EM, Wilson ID: Analytical strategies in metabonomics. J Proteome Res 2007, 6(2):443–458. 10.1021/pr0605217 Urano K, Maruyama K, Ogata Y, Morishita Y, Takeda M, Sakurai N, Suzuki H, Saito K, Shibata D, Kobayashi M, Yamaguchi-Shinozaki K, Shinozaki K: Characterization of the ABA-regulated global responses to dehydration in Arabidopsis by metabolomics. Plant J 2008, 57: 1065–1078. 10.1111/j.1365-313X.2008.03748.x Werf MJ, Overkamp KM, Muilwijk B, Coulier L, Hankemeier T: Microbial metabolomics: toward a platform with full metabolome coverage. Anal Biochem 2007, 370: 17–25. 10.1016/j.ab.2007.07.022 Williams R, Lenz EM, Wilson AJ, Granger J, Wilson ID, Major H, Stumpf C, Plumb R: A multi-analytical platform approach to the metabonomic analysis of plasma from normal and Zucker (fa/fa) obese rats. Mol Biosyst 2006, 2(3–4):174–183. 10.1039/b516356k Zhang J, Carey V, Gentleman R: An extensible application for assembling annotation for genomic data. Bioinformatics 2003, 19: 155–156. 10.1093/bioinformatics/19.1.155 Côté RG, Jones P, Martens L, Kerrien S, Reisinger F, Lin Q, Leinonen R, Apweiler R, Hermjakob H: The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 2007, 8: 401. 10.1186/1471-2105-8-401 Li H, Ding G, Xie L, Li Y: PAnnBuilder: an R package for assembling proteomic annotation data. Bioinformatics 2009, 25(8):1094–1095. 10.1093/bioinformatics/btp100 Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcántara R, Darsow M, Guedj M, Ashburner M: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 2008, (36 Database):D344-D350. KEGG Database: Kyoto Encyclopedia of Gene and Genomes.[http://www.genome.jp/kegg] [Compounds Database] Cui Q, Lewis IA, Hegeman AD, Anderson ME, Li J, Schulte CF, Westler WM, Eghbalnia HR, Sussman MR, Markley JL: Metabolite identification via the Madison Metabolomics Consortium Database. Nat Biotechnol 2008, 26(2):162–164. 10.1038/nbt0208-162 Wishart DS: Human Metabolome Database: completing the 'human parts list'. Pharmacogenomics 2007, 8(7):683–686. 10.2217/14622416.8.7.683 Knox C, Shrivastava S, Stothard P, Eisner R, Wishart DS: BioSpider: a web server for automating metabolome annotations. Pac Symp Biocomput 2007, 145–156. full_text Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A: BioMart-biological queries made easy. BMC Genomics 2009, 10: 22. 10.1186/1471-2164-10-22 van Iersel MP, Pico AR, Kelder T, Gao J, Ho I, Hanspers K, Conklin BR, Evelo CT: The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 2010, 11: 5. 10.1186/1471-2105-11-5 Kind T, Scholz M, Fiehn O: How large is the metabolome? A critical analysis of data exchange practices in chemistry. PLoS ONE 2009, 4(5):e5440. 10.1371/journal.pone.0005440 Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, Moritz T, Lundgren K, Roessner-Tunali U, Forbes MG, Willmitzer L, Fernie AR, Kopka J: GC-MS libraries for the rapid identification of metabolites in complex biological samples. FEBS Lett 2005, 579(6):1332–1337. 10.1016/j.febslet.2005.01.029 PubChem: Database of chemical compounds.2009. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=pccompound] [Compounds] Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S: The EcoCyc Database. Nucleic Acids Res 2002, 30: 56–58. 10.1093/nar/30.1.56 Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD: Computational prediction of human metabolic pathways from the complete human genome. Genome Biol 2005, 6: R2. 10.1186/gb-2004-6-1-r2 Plant Metabolic Network: PlantCyc.2009. [http://www.plantcyc.org] [Plant metabolite database] R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2004. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Li FLC, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80 Pages H, Carlson M, Falcon S, Li N: AnnotationDbi: Annotation Database Interface. 2009. [R package version 1.6.1] [R package version 1.6.1] SQLite: Lightweight database.[http://www.sqlite.org] [Self-contained, zero-configuration] Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498–2504. 10.1101/gr.1239303 Carey VJ, Gentry J, Whalen E, Gentleman R: Network structures and algorithms in Bioconductor. Bioinformatics 2005, 21: 135–136. 10.1093/bioinformatics/bth458 Carrari F, Baxter C, Usadel B, Urbanczyk-Wochniak E, Zanor MI, Nunes-Nesi A, Nikiforova V, Centero D, Ratzka A, Pauly M, Sweetlove LJ, Fernie AR: Integrated analysis of metabolite and transcript levels reveals the metabolic shifts that underlie tomato fruit development and highlight regulatory aspects of metabolic network behavior. Plant Physiol 2006, 142(4):1380–1396. 10.1104/pp.106.088534 Akiyama K, Chikayama E, Yuasa H, Shimada Y, Tohge T, Shinozaki K, Hirai MY, Sakurai T, Kikuchi J, Saito K: PRIMe: a Web site that assembles tools for metabolomics and transcriptomics. In Silico Biol 2008, 8(3–4):339–345. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y: KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008, (36 Database):D480-D484. PubChem: Database of chemical substances.2009. [http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcsubstance] [Substance] Shinbo Y, Nakamura Y, Altaf-Ul-Amin M, Asahi H, Kurokawa K, Arita M, Saito K, Ohta D, Shibata D, Kanaya S: In Plant Metabolomics, Springer 2006 chap. II.6: KNApSAcK: A Comprehensive Species-Metabolite Relationship Database. Edited by: Saito K, Dixon RA, Willmitzer L. 165–184. Taguchi R, Nishijima M, Shimizu T: Basic analytical systems for lipidomics by mass spectrometry in Japan. Methods Enzymol 2007, 432: 185–211. full_text Sud M, Fahy E, Cotter D, Brown A, Dennis EA, Glass CK, Merrill AH, Murphy RC, Raetz CRH, Russell DW, Subramaniam S: LMSD: LIPID MAPS structure database. Nucleic Acids Res 2007, (35 Database):D527-D532. 10.1093/nar/gkl838