Where are the large and difficult datasets?

Advances in Data Analysis and Classification - Tập 3 - Trang 25-38 - 2009
Adrien Jamain1, David J. Hand2
1BNP-Paribas, London, UK
2Department of Mathematics, Institute for Mathematical Sciences, London, UK

Tóm tắt

A great many comparative performance assessments of classification rules have been undertaken, ranging from small ones involving just one or two methods, to large ones involving many tens of methods. We are undertaking a meta-analytic study of these studies, attempting to distil some overall conclusions. This paper describes just one of our observations. The dataset analysed in this paper contains 5,203 error rates taken from 45 articles and describing 146 datasets. One curious general relationship which was persistent in our data, despite the fact that we were looking at results mixed between distributions rather than conditional on distributions, was that error rate decreased with increasing dataset size. We believe this to be an artefact of the way datasets are collected by the research community.

Tài liệu tham khảo

Atlas L, Connor J, Dong P, Lippman A, Cole R, Muthusamy Y (1991) A performance comparison of trained multi-player perceptrons and trained classification trees. In: Systems, man and cybernetics: proceedings of the 1989 IEEE international conference, Cambridge, Hyatt Regency, pp 915–920 Blake CL, Merz CJ (1998) UCI repository of machine learning databases. http://www1.ics.uci.edu/~mlearn/MLRepository.html, University of California, Irvine, Dept. of Information and Computer Sciences Brazdil PB, Soares C, Pinto da Costa J (2003) Ranking learning algorithms: using IBL and meta-learning on accuracy and time results. Mach Learn 50: 251–277 Eklund PW, Hoang A (2002) A performance survey of public domain supervised machine learning algorithms. http://citeseer.nj.nec.com/551273.html Hand DJ (1999) Intelligent data analysis: an introduction. In: Berthold M, Hand DJ(eds) Intelligent data analysis. Springer, Berlin Holte RC (1993) Very simple classification rules perform well on most commonly used datasets. Mach Learn 11: 63–91 Jamain A (2004) Meta-analysis of classification methods. PhD thesis, Department of Mathematics, Imperial College, London (2004) Jamain A, Hand DJ (2005) The Naive Bayes mystery: a classification detective story. Pattern Recognit Lett 26: 1752–1760 Jamain A., Hand DJ (2008) Mining supervised classification performance studies: a meta-analytic investigation. J Classif 25(1): 87–112 Lim T, Loh W, Shih Y (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40: 203–228 METAL Consortium . Esprit project METAL (#26.357). http://www.metal-kdd.org, 2002 Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, New York Perlich C, Provost F, Simonoff JS (2003) Tree induction versus logistic regresion: a learning-curve analysis. J Mach Learn Res 4: 211–255 Quinlan JR (1994) Comparing connectionist and symbolic learning methods, volume I: constraints and Prospects. MIT Press, Cambridge, pp 445–456. http://citeseer.nj.nec.com/quinlan94comparing.html Rasmussen CE, Neal RM, Hinton GE, van Camp D, Revow M, Ghahramani Z, Kustra R, Tibshirani R (1996) DELVE, Data for evaluating learning in valid experiments. http://www.cs.toronto.edu/~delve/ Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Discov 1: 317–328 Sargent DJ (2001) Comparison of artificial neural networks with other statistical approaches. Cancer 91: 1636–1642 Shavlik JW, Mooney RJ, Towell GG (1991) Symbolic and neural learning algorithms: an experimental comparison. Mach Learn 6: 111–143 Soares C (2002) Is the UCI repository useful for data mining? In: Lavrac N, Motoda H, Fawcett T (eds) Proceedings of the ICML-2002 workshop on data mining lessons learned Sohn SY (1999) Meta-analysis of classification algorithms for pattern recognition. IEEE Trans Pattern Recognit Mach Intell 21(11): 1137–1144 Viswanathan M, Webb GI (1998) Classification learning using all rules. In: 11th European conference on machine learning. Springer, Berlin, pp 150–159 Zarndt F (1995) A comprehensive case study: an examination of machine learning and connectionnist algorithms. http://citeseer.nj.nec.com/481595.html