Characterization and evaluation of similarity measures for pairs of clusterings

Darius Pfitzner1, Richard Leibbrandt1, David Powers1
1Department of Computer Science, Engineering and Mathematics, Flinders University of South Australia, Bedford Park, Australia

Tóm tắt

Từ khóa


Tài liệu tham khảo

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data-mining applications

Arabie P, Boorman SS (1973) Multidimensional scaling of measures of distance between partitions. Math Psychol 10: 148–203

Baroni-Urbani C, Buser MW (1976) Similarity of binary data. Syst Zool 25(3): 251–259

Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software

Braun-Blanquet JNY (1932) Plant sociology: the study of plant communities. McGraw-Hill Book Company, Inc, New York

Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Fayyad UN, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT press, Cambridge, pp 153–180

Coombs CH, Dawes RM, Tversky A (1970) Mathematical psychology: an elementary introduction. Prentice-Hall, Englewood Cliffs, NJ

Dennis RLH, Williams WR, Shreeve TG (1998) Faunal structures among european butterflies: evolutionary implications of bias for geography, endemism and taxonomic affiliation. Ecography 21: 181–203

Dice LE (1945) Measures of the amount of ecologic association between species. Ecology 26(3): 297–302

Fager EW, McGowan JA (1963) Zooplankton species groups in the north pacific:co-occurrences of species can be used to derive groups whose members react similarly to water-mass types. Science 140: 453–460 doi: 10.1126/science.140.3566.453

Faith DP (1983) Asymmetric binary similarity measures. Oecologia 57(3): 287–290

Filkov V, Skiena S (2004) Heterogeneous data integration with the consensus clustering formalism. Data Integration in the Life Sciences (DILS). Int Workshop No 1 2994: 110–123

Forbes S (1925) Method of determining and measuring the associative relations of species. Science 61(1585): 518–524

Fossum TV, Haller SM (2004) Measuring card sort orthogonality. Expert Syst 22(3): 139–146

Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. Am Stat Assoc 78(383): 553–569

Fred A, Jain A (2003) Robust data clustering. In: IEEE computer society conference on computer vision and pattern recognition

Gilbert N, Wells TCE (1966) Analysis of quadrat data. Ecology 54(3): 675–685

Goodall DW (1967) The distribution of the matching coefficient. Biometics 23(4): 647–656

Halkidi M, Batistikis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst 17: 107–145

Hamann U (1961) Merkmalbestand und verwandtschaftsbeziehungen de farinosae: Ein beitrag zum system der monokotyledonen. Wildenowia 2: 639–768

Hayek LC (1994) Analysis of amphibian biodiversity data. In: Heyer WR, Donnelly MA, McDiarmid RW, Hayek L-AC, Foster MS (eds) Measuring and monitoring biological diversity: standard methods for amphibians. Smithsonian Institution Press

Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415

Holliday JD, Hu C-Y, Willett P (2002) Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2d fragment bit-strings. Comb Chem High Throughput Screen 5(2): 155–166

Horibe Y (1985) Entropy and correlation. IEEE Trans Syst Man Cybern (SMC) SMC-15(5): 641–642

Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses. et dans quelques regiones voisines. Naturelles Bulletin de la Societe Vaudoise des Sciences, pp 241–272

Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 2(32): 241–254

Karypis G, Han E-H, Kumar V (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput 32(8): 68–75

Knobbe AJ, Adrianns PW (1996) Analysis of binary association. In: Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon, pp 311–314

Kulczynski S (1927) Zespoly roslin w pieninach—die pflanzenassoziationen der pieninen. Bulletin international de l’acadmie polonaise des sciences et des lettres B(2): 57–203

Kvalseth TO (1987) Entropy and correlation: some comments. IEEE Trans Syst Man Cybern SMC-17: 517–519

Lee TT (1987) An information theoretic analysis of relational databases - part 1: data dependencies and information metric. IEEE Trans Softw Eng SE-13(10): 1049–1061

Linfoot EH (1957) An informational measure of correlation. Inf Control 1: 85–87

Lopez de Mantaras R (1989) Id3 revisited: a distance-based criterion for attribute selection. In: International symposium on methodologies for intelligent systems (ISMIS-89). Charlotte, North California

MacQueen J (1967) Some methods for classification and analysis of multivariate observations

Malvestuto FM (1986) Statistical treatment of the information content of a database. Inf Syst 11(3): 211–223

Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT Press, New York

McConnaughey BH (1964) The determination and analysis of plankton communities. Marine Research Indonesia Special (Penelitian Laut Di Indonesia) Spec. no. 30

Meila M (2003) Comparing clusterings by variation of information. Proceedings of the 16th annual conference of computational learning theory (COLT)

Michael EL (1920) Marine ecology and the coefficient of association: A plea in behalf of quantitative biology. J Ecol 8(1): 54–59

Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Press, Boston–Dordrecht

Mirkin B (2001) Eleven ways to look at the chi-squared coefficient for contingency tables. Am Stat 55(6): 111–120

Mountford MD (1962) An index of similarity and its application to classificatory problems. In: Murphy PW (ed) Progress in soil zoology. Butterworth, London, pp 43–50

Pawlak Z, Wong SK, Ziarko WIJM-M (1988) Rough sets: probabilistic versus deterministic approach. Int J Man Mach Stud 29(1): 81–95

Powers DMW (2007) Expected information in the transmission of an equality selection of distribution/clustering or of individual class labels, echnical report, Flinders University (S.A.)

Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge

Quinlan JR (1990) Induction of decision trees. In: Shavlik JW, Dietterich TG (eds) Readings in machine learning, Morgan Kaufmann. Originally published in machine learning 1:81–106, 1986.

Rajski C (1961) A metric space of discrete probability distributions. Inf Control 4(4): 371–377

Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336): 846–850

Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434): 1115–1118

Russell PF, Rao TR (1940) On habitat and association of species of anopheline larvae in southeastern, madras. Malaria Inst India 3: 153–178

Savage RM (1934) The breeding behavior of the common frog, rana remporaria linn., and of the common toad bufo bufo bufo linn. Zoological Society of London, pp 55–70

Sneath PHA (1968) Vigour and pattern in taxonomy. Gen Microbiol 54(1): 1–11

Sneath PHA, Sokal RR (1973) Numerical taxonomy. Freeman and Company, San Francisco

Sokal RR, Sneath PHA (1964) Principles of numerical taxonomy. Syst Zool 13: 106–108

Sorgenfrei T (1958) Molluscan assemblages from the marine middle miocene of south jutland and their environments

Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitionings. Mach Learn Res 3: 583–617

Tarwid K (1960) Szacowanie zbieznosci nisz ekologicznych gatunkow droga oceny prawdopodobienstwa spotykania sie ich w polowach. Ecol Polska B(6): 115–130

Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Pres, New York

Thurstone L (1927) A law of comparative judgement. Psychol Rev 34: 278–286

Wallace D.L. (1983) A method for comparing two hierarchical clusterings: comment. Am Stat Assoc 78(383): 569–576

Wan SJ, Wong SKM (1989) A measure for concept dissimilarity and its applications in machine learning. In: International conference on computing and information. Toronto North, Canada, pp 23–27

Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Amsterdam

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

Yao YY, Wong SKM, Butz CJ (1999) On information theoretic measures of attribute importance. In: Zhong N (ed) PAKDD’99. Beijing, China, pp 133–137

Yule GU (1912) On the methods of measuring association between two attributes. R Soc Lond 75(6): 579–642

Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384