Visualizing high density clusters in multidimensional data using optimized star coordinates

Computational Statistics - Tập 26 - Trang 655-678 - 2011
Tran Van Long1,2, Lars Linsen1
1School of Engineering and Science, Jacobs University, Bremen, Germany
2University of Transport and Communications, Hanoi, Vietnam

Tóm tắt

Multidimensional multivariate data have been studied in different areas for quite some time. Commonly, the analysis goal is not to look into individual records but to understand the distribution of the records at large and to find clusters of records that exhibit correlations between dimensions or variables. We propose a visualization method that operates on density rather than individual records. To not restrict our search for clusters, we compute density in the given multidimensional space. Clusters are formed by areas of high density. We present an approach that automatically computes a hierarchical tree of high density clusters. For visualization purposes, we propose a method to project the multidimensional clusters to a 2D or 3D layout. The projection method uses an optimized star coordinates layout. The optimization procedure minimizes the overlap of projected clusters and maximally maintains the cluster shapes, compactness, and distribution. The star coordinate visualization allows for an interactive analysis of the distribution of clusters and comprehension of the relations between clusters and the original dimensions. Clusters are being visualized using nested sequences of density level sets leading to a quantitative understanding of information content, patterns, and relationships.

Tài liệu tham khảo

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD international conference on management of data, Washington, DC, pp 95–104 Andrews D (1972) Plots of high-dimensional data. Biometrics 28: 125–136 Ankerst M, Keim DA, Kriegel H-P (1996) Circle segments: a technique for visually exploring large multidimensional data sets. IEEE Visualization Proceedings, Hot topic session, San Francisco, CA Artero AO, de Oliveira MCF (2004) Viz3d: effective exploratory visualization of large multidimensional data sets. Computer graphics and image processing, the 17th Brazilian symposium on SIBGRAPI, pp 304–347 Balzer M, Deussen O (2007) Level-of-detail visualization of clustered graph layouts. Asia-Pacific symposium on visualization (APVIS), pp 133–140 Card SK, Mackinlay J, Shneiderman B (1999) Readings in information visualization: using vision to think. Morgan Kaufmann, San Francisco Chambers JM, Cleveland WS, Tukey PA, Kleiner B (1983) Graphical methods for data analysis. Wadsworth, Belmont Dhillon IS, Modha DS, Spangler WS (1998) Visualizing class structure of multidimensional data. In: Proceedings of 30th symposium on interface: computing science and statistics, pp 488–493 Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, New York Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD, Portand, Oregon, pp 226–231 Fua Y-H, Ward MO, Rundensteiner EA (1999) Hierarchical parallel coordinates for exploration of large datasets. Proceedings of IEEE Symposium on Information Visualization, pp 43–50 Heckel B, Hamann B (1998) Visualization of cluster hierarchies. In: Erbacher RF, Pang A (eds) Proceedings of SPIE: visual data exploration and analysis V, 3298:162–171 Hendley RJ, Drew NS, Wood AM, Beale RE (1995) Case study—narcissus: visualizing information. In: Proceedings of the IEEE information visualization 95, pp 51–58 Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings international conference knowledge discovery and data mining, pp 58–65 Hinneburg A, Keim DA, Wawryniuk M (1999) Hd-eye: visual mining of high-dimensional data. IEEE Comput Graph Appl 19: 22–31 Hoffman P, Grinstein G, Marx K, Grosse I, Stanley E (1997) Dna visual and analytic data mining. In: Visualization ’97., Proceedings, pp 437–441 Huber PJ (1985) Projection pursuit. Ann Stat 13: 435–475 Inselberg A (1985) The phane with parallel coordinates. Vis Comput 1: 69–97 Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs Kandogan E (2000) Star coordinates: a multi-dimensional visualization technique with uniform treatment of dimensions. In: Proceedings of IEEE information visualization symposium (hot topics), pp 4–8 Kandogan E (2001) Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. Proceeings of ACM international conf. knowledge discovery and data mining, pp 107–116 Liu D, Sprague AP, Gray JG (2004) Polycluster: an interactive visualization approach to construct classification rules. The 2004 international conference on machine learning and applications, Louisville, KY, USA, pp 280–287 Lorensen WE, Cline HE (1987) Marching cubes: a high resolution 3d surface construction algorithm. Comput Graph 21: 163–169 Scott DW, Sain SR (2004) Multidimensional density estimation, in handbook of statistics. In: Rao CR, Wegman EJ (eds) Vol 23: data mining and computational statistics. Elsevier, Amsterdam, pp 229–261 Shaik JS, Yeasin M (2006) Visualization of high dimensional data using an automated 3d star coordinate system. International joint conference on neural networks, pp 1339–1346 Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings 24th very large databases conference, pp 428–439 Sips M, Neubert B, Lewis JP, Hanrahan P (2009) Selecting good views of high-dimensional data using class consistency. Comput Graph Forum 28(3): 831–838 Sprenger TC, Brunella R, Gross MH (2000) H-blob: a hierarchical visual clustering method using implicit surfaces. In: Proceedings of the conference on visualization ’00, pp 61–68 Sprenger TC, Gross MH, Bielser D, Strasser T (1998) Ivory—an object-oriented framework for physics-based information visualization in java. In: Proceedings of the 1998 IEEE symposium on information visualization, pp 79–86 Sprenger TC, Gross MH, Eggenberger A, Kaufmann M (1997) A framework for physically-based information visualization. In: Proceedings of eurographics workshop on visualization ’97 (Boulogne sur Mer, France), pp 77–86 Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20: 25–47 Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2): 397–418 Wang W, Yang J, Muntz R (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, pp 186–195 Wegman EJ (1990) Hyper-dimensional data analysis using parallel coordinates. J Am Stat Assoc 21: 664–675 Wegman EJ, Luo Q (2002) On methods of computer graphics for visualizing densities. J Comput Graph Stat 11: 137–162 Wong A, Lane T (1983) A kth nearest neighbor clustering procedure. J R Stat Soc Ser B 45: 362–368 Yanchang Z, Junde S (2003) Agrid: an efficient algorithm for clustering large high-dimensional datasets. Lect Notes Comput Sci 2637: 271–282