Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets

BMC Bioinformatics - Tập 13 - Trang 1-15 - 2012

Fangzhou Yao^1,2, Jeff Coquery^2,3, Kim-Anh Lê Cao²

¹Shanghai University of Finance and Economics, Shanghai, P.R. China

²Queensland Facility for Advanced Bioinformatics, University of Queensland, St Lucia, Australia

³Sup'Biotech, Villejuif, France

Tóm tắt

A key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A well-established technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data. We propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA). On simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment. IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' web-interface.

Tài liệu tham khảo

Jolliffe I: Principal Component Analysis. second edition. Springer, New York; 2002. Lee S, Batzoglou S: Application of independent component analysis to microarrays. Genome Biology 2003, 4(11):R76. 10.1186/gb-2003-4-11-r76 Purdom E, Holmes S: Error distribution for gene expression data. Statistical applications in genetics and molecular biology 2005, 4: 16. Huang D, Zheng C: Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 2006, 22(15):1855. 10.1093/bioinformatics/btl190 Engreitz J, Daigle B Jr, Marshall J, Altman R: Independent component analysis: Mining microarray data for fundamental human gene expression modules. Journal of Biomedical Informatics 2010, 43: 932–944. 10.1016/j.jbi.2010.07.001 Scholz M, Gatzek S, Sterling A, Fiehn O, Selbig J: Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics 2004, 20(15):2447–2454. 10.1093/bioinformatics/bth270 Frigyesi A, Veerla S, Lindgren D, Höglund M: Independent component analysis reveals new and biologically significant structures in micro array data. BMC bioinformatics 2006, 7: 290. 10.1186/1471-2105-7-290 Comon P: Independent component analysis, a new concept? Signal Process 1994, 36: 287–314. 10.1016/0165-1684(94)90029-9 Hyvärinen A, Oja E: Indepedent Component Analysis: Algorithms and Applications. Neural Networks 2000, 13(4–5):411–430. 10.1016/S0893-6080(00)00026-5 Hyvärinen A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons; 2001. Liebermeister W: Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002, 18: 51–60. 10.1093/bioinformatics/18.1.51 Wienkoop S, Morgenthal K, Wolschin F, Scholz M, Selbig J, Weckwerth W: Integration of Metabolomic and Proteomic Phenotypes. Molecular & Cellular Proteomics 2008, 7: 1725–1736. 10.1074/mcp.M700273-MCP200 Rousseau R, Govaerts B, Verleysen M: Combination of Independent Component Analysis and statistical modelling for the identification of metabonomic biomarkers in H-NMR spectroscopy. Tech rep, Universté Catholique de Louvain and Universté Paris I 2009. Kong W, Vanderburg C, Gunshin H, Rogers J, Huang X: A review of independent component analysis application to microarray gene expression data. BioTechniques 2008, 45(5):501. 10.2144/000112950 Teschendorff A, Journée M, Absil P, Sepulchre R, Caldas C: Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS computational biology 2007, 3(8):e161. 10.1371/journal.pcbi.0030161 Jolliffe I, Trendafilov N, Uddin M: A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics 2003, 12: 531–547. 10.1198/1061860032148 Donoho D, Johnstone I: Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81: 425–455. 10.1093/biomet/81.3.425 Shen H, Huang JZ: Sparse Principal Component Analysis via Regularized Low Rank Matrix Approximation. Journal of Multivariate Analysis 2008, 99: 1015–1034. 10.1016/j.jmva.2007.06.007 Davies D, Bouldin D: A cluster separation measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on 1979, (2):224–227. Bushel P, Wolfinger RD, Gibson G: Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology 2007., 1: Singh D, Febbo P, Ross K, Jackson D, Manola J, Ladd C, Tamayo P, Renshaw A, D'Amico A, Richie J, Lander E, Loda M, Kantoff P, Golub T, Sellers W: Gene expression correlates of clinical prostate cancer behavior. Cancer cell 2002, 1(2):203–209. 10.1016/S1535-6108(02)00030-2 Villas-Boâs S, Moxley J, Åkesson M, Stephanopoulos G, Nielsen J: High-throughput metabolic state analysis: the missing link in integrated functional genomics. Biochemical Journal 2005, 388: 669–677. 10.1042/BJ20041162 Cangelosi R, Goriely A: Component retention in principal component analysis with application to cDNA microarray data. Biology Direct 2007., 2(2): Bezdek J, Pal N: Some new indexes of cluster validity. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 1998, 28(3):301–315. 10.1109/3477.678624 Bartlett M, Movellan J, Sejnowski T: Face recognition by independent component analysis. Neural Networks, IEEE Transactions on 2002, 13(6):1450–1464. 10.1109/TNN.2002.804287 Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, Midori A, Hill D, Issel-Tarver L, Kasarskis A, Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene Ontology: tool for the unification of biology. Nature genetics 2000, 25: 25–29. 10.1038/75556 Bauer I, Vollmar B, Jaeschke H, Rensing H, Kraemer T, Larsen R, Bauer M: Transcriptional activation of heme oxygenase-1 and its functional significance in acetaminophen-induced hepatitis and hepatocellular injury in the rat. Journal of hepatology 2000, 33(3):395–406. 10.1016/S0168-8278(00)80275-5 Hamadeh H, Bushel P, Jayadev S, DiSorbo O, Bennett L, Li L, Tennant R, Stoll R, Barrett J, Paules R, Blanchard K, Afshari C: Prediction of compound signature using high density gene expression profiling. Toxicological Sciences 2002, 67(2):232. 10.1093/toxsci/67.2.232 Heijne W, Slitt A, Van Bladeren P, Groten J, Klaassen C, Stierum R, Van Ommen B: Bromobenzene-induced hepatotoxicity at the transcriptome level. Toxicological Sciences 2004, 79(2):411. 10.1093/toxsci/kfh128 Heinloth A, Irwin R, Boorman G, Nettesheim P, Fannin R, Sieber S, Snell M, Tucker C, Li L, Travlos G, Vansant G, Blackshear P, Tennant R, Cunningham M, Paules R: Gene expression profiling of rat livers reveals indicators of potential adverse effects. Toxicological Sciences 2004, 80: 193. 10.1093/toxsci/kfh145 Waring J: Development of a DNA microarray for toxicology based on hepatotoxin-regulated sequences. Environmental health perspectives 2003, 111(6):863. Wormser U, Calp D: Increased levels of hepatic metallothionein in rat and mouse after injection of acetaminophen. Toxicology 1988, 53(2–3):323–329. 10.1016/0300-483X(88)90224-7 Flaherty K, DeLuca-Flaherty C, McKay D: Three-dimensional structure of the ATPase fragment of a 70 K heat-shock cognate protein. Nature 1990, 346(6285):623. 10.1038/346623a0 Tavaria M, Gabriele T, Kola I, Anderson R: A hitchhiker's guide to the human Hsp70 family. Cell Stress & Chaperones 1996, 1: 23. 10.1379/1466-1268(1996)001<0023:AHSGTT>2.3.CO;2 Panaretou B, Siligardi G, Meyer P, Maloney A, Sullivan J, Singh S, Millson S, Clarke P, Naaby-Hansen S, Stein R, Cramer R, Mollapour M, Workman P, Piper P, Pearl L, Prodromou C: Activation of the ATPase activity of hsp90 by the stress-regulated cochaperone aha1. Molecular cell 2002, 10(6):1307–1318. 10.1016/S1097-2765(02)00785-2 Lê Cao KA, González I, Déjean S: integrOmics: an R package to unravel relationships between two omics data sets. Bioinformatics 2009, 25(21):2855–2856. 10.1093/bioinformatics/btp515 mixOmics[http://www.math.univ-toulouse.fr/~biostat/mixOmics] Bach F, Jordan M: Kernel Independent Component Analysis. Journal of Machine Learning Research 2002, 3: 1–48. Hastie T, Tibshirani R: Independent Components Analysis through Product Density Estimation. 2002. Himberg J, Hyvarinen A, Esposito F: Validating the independent components of neuroimaging time series via clustering and visualization. Neuroimage 2004, 22(3):1214–1222. 10.1016/j.neuroimage.2004.03.027 Zou H, Hastie T, Tibshirani R: Sparse Principal Component Analysis. J Comput Graph Statist 2006, 15(2):265–286. 10.1198/106186006X113430 Witten D, Tibshirani R, Hastie T: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 2009, 10(3):515. 10.1093/biostatistics/kxp008 Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 1996, 58: 267–288.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA