New Approaches to Principal Component Analysis for Trees

Statistics in Biosciences - Tập 4 - Trang 132-156 - 2012
Burcu Aydın1, Gábor Pataki2, Haonan Wang3, Alim Ladha2, Elizabeth Bullitt2, J. S. Marron2
1HP Laboratories, Palo Alto, USA
2UNC at Chapel Hill, Chapel Hill, USA
3Colorado State University, Fort Collins, USA

Tóm tắt

Object Oriented Data Analysis is a new area in statistics that studies populations of general data objects. In this article we consider populations of tree-structured objects as our focus of interest. We develop improved analysis tools for data lying in a binary tree space analogous to classical Principal Component Analysis methods in Euclidean space. Our extensions of PCA are analogs of one dimensional subspaces that best fit the data. Previous work was based on the notion of tree-lines. In this paper, a generalization of the previous tree-line notion is proposed: k-tree-lines. Previously proposed tree-lines are k-tree-lines where k=1. New sub-cases of k-tree-lines studied in this work are the 2-tree-lines and tree-curves, which explain much more variation per principal component than tree-lines. The optimal principal component tree-lines were computable in linear time. Because 2-tree-lines and tree-curves are more complex, they are computationally more expensive, but yield improved data analysis results. We provide a comparative study of all these methods on a motivating data set consisting of brain vessel structures of 98 subjects.

Tài liệu tham khảo

Alfaro CA, Aydın B, Bullitt E, Ladha A, Valencia CE (2011) Dimension reduction in principal component analysis for trees. Manuscript in progress Aydın B (2009) Principal component analysis of tree structured objects. Ph.D. Thesis, University of North Carolina at Chapel Hill Aydın B, Pataki G, Wang H, Bullitt E, Marron JS (2009) A principal component analysis for trees. Ann Appl Stat 3:1597–1615 Aydın B, Pataki G, Wang H, Ladha A, Bullitt E, Marron JS (2011) Visualizing the structure of large trees. Electron J Stat 5:405–420 Aylward S, Bullitt E (2002) Initialization, noise, singularities and scale in height ridge traversal for tubular object centerline extraction. IEEE Trans Med Imaging 21:61–75 Banks D, Constantine GM (1998) Metric models for random graphs. J Classif 15:199–223 Bazaraa MS, Shetty CM (1979) Nonlinear programming: Theory and algorithms. Wiley, New York Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140 Breiman L, Friedman JH, Olshen JA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont Bullitt E, Gerig G, Pizer SM, Aylward SR (2003) Measuring tortuosity of the intracerebral vasculature from MRA images. IEEE Trans Med Imaging 22:1163–1171 Bullitt E, Zeng D, Ghosh A, Aylward SR, Lin W, Marks BL, Smith K (2010) The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiol Aging 31(2):290–300 Cook WJ, Cunningham WH, Pulleyblank WR, Schrijver A (1997) Combinatorial optimization. Wiley, New York Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Oxford University Press, New York Handle, http://hdl.handle.net/1926/594 (2008) Land AH, Doig AG (1960) An automatic method of solving discrete programming problems. Econometrica 28(3):497–520 Lawler EL, Bell MD (1966) A method for solving discrete optimization problems. Oper Res 14(6):1098–1112 Lawler EL, Wood DE (1966) Branch-and-bound methods: A survey. Oper Res 14:699–719 Nye T (2011) Principal component analysis in the space of phylogenetic trees. Unpublished manuscript, http://www.mas.ncl.ac.uk/~ntmwn/pca/preprint.pdf Schrijver A (1998) Theory of linear and integer programming. Wiley, New York Shen D, Shen H, Bhamidi S, Munoz-Maldonado Y, Kim Y, Marron JS (2011) Functional data analysis for trees. Manuscript in progress Wang H, Marron JS (2007) Object oriented data analysis: sets of trees. Ann Stat 35:1849–1873 Wang Y, Marron JS, Aydın B, Ladha A, Bullitt E, Wang H (2011) Nonparametric regression model with tree-structured response. Manuscript in progress