A systematic analysis of performance measures for classification tasks

Information Processing & Management - Tập 45 - Trang 427-437 - 2009
Marina Sokolova1, Guy Lapalme2
1Electronic Health Information Lab, Children’s Hospital of Eastern Ontario, Ottawa, Canada
2Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montréal, Canada

Tài liệu tham khảo

Asuncion, A., & Newman, D. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. <http://www.ics.uci.edu/mlearn/MLRepository.html>. Bengio, S., Mariéthoz, J., & Keller, M. (2005). The expected performance curve. In Proceedings of the ICML’05 workshop on ROC analysis in machine learning (pp. 43–50). Blitzer, 2007, Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification, 440 Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In KDD-2002: Workshop on multi-relational data mining (pp. 21–35). Bobicev, 2008, An effective and robust method for short text classification, 1444 Cohen, 1988 Costa, E., Lorena, A., Carvalho, A., & Freitas, A. (2007). A review of performance evaluation measures for hierarchical classifiers. In Proceedings of the AAAI 2007 workshop “Evaluation methods for machine learning” (pp. 1–6). Demsar, 2006, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7, 1 Dietterich, 1998, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, 10, 1895, 10.1162/089976698300017197 Duda, 1973 Eisner, R., Poulin, B., Szafron, D., Lu, P., & Greiner, R. (2005). Improving protein function prediction using the hierarchical structure of the gene ontology. In Proceedings of IEEE symposium on computational intelligence in bioinformatics and computational biology (pp. 1–10). Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E. (2005). Pulse: Mining customer opinions from free text. In Proceedings of the 6th international symposium on intelligent data analysis (IDA 2005) (pp. 121–132). Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In Proceedings of 27th European conference on IR research (ECIR 2005) (pp. 345–359). Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1997). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR-97) (pp. 192–201). Huang, J., & Ling, C. (2007). Constructing new and better evaluation measures for machine learning. In Proceedings of the 20th international joint conference on artificial intelligence (IJCAI’2007) (pp. 859–864). Isselbacher, 1994 Japkowicz, N. (2006). Why question machine learning evaluation methods? In Proceedings of the AAAI’06 workshop on evaluation methods for machine learning (pp. 6–11). Kazawa, H., Izumitani, T., Taira, H., & Maeda, E. (2005). Maximal margin labeling for multi-topic text categorization. In Advances in neural information processing systems (NIPS’04), (Vol. 17, pp. 649–656). Kiritchenko, S., Matwin, S., Nock, R., & Famili, A. F. (2006). Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Proceedings of the 19th Canadian conference on AI (AI’2006) (pp. 395–406). Lachiche, N., & Flach, P. A. (2003). Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In Proceedings of ICML’2003 (pp. 416–423). Langley, 1996 Li, T., Zhang, C., & Zhu, S. (2006). Empirical studies on multi-label classification. In Proceedings of the 18th IEEE international conference on tools with artificial intelligence (pp. 86–92). Marchand, 2002, The set covering machine, Journal of Machine Learning Research, 3, 723 Mewes, 1997, MIPS: A database for protein sequences, homology data and yeast genome information, Nucleic Acids Research, 25, 28, 10.1093/nar/25.1.28 Mitchell, 1997 Nigam, 2004, Towards a robust metric of opinion, 98 Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of empirical methods of natural language processing (EMNLP’02) (pp. 79–86). Rijsbergen, 1979 Salton, 1993 Salzberg, 1999, On comparing classifiers: A critique of current research and methods, Data Mining and Knowledge Discovery, 1, 1 Sasaki, 2007, Multi-topic aspects in clinical text classification, 62 Sebastiani, 2002, Machine learning in automated text categorization, ACM Computing Surveys, 34, 1, 10.1145/505282.505283 Shawe-Taylor, 2004 Snyder, B., & Barzilay, R. (2007). Database-text alignment via structured multilabel classification. In Proceedings of the international joint conference on artificial intelligence (IJCAI-2007) (pp. 1713–1718). Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Proceedings of the ACS Australian joint conference on artificial intelligence (pp. 1015–1021). Sokolova, M., & Lapalme, G. (2007). Performance measures in classification of human communication. In Proceedings of the 20th Canadian conference on artificial intelligence ( AI’2007) (pp. 159–170). Sun, 2003, Performance measurement framework for hierarchical text classification, Journal of the Americal Society for Information Science and Technology, 54, 1014, 10.1002/asi.10298 Tan, 2004, Selecting the right objective measure for association analysis, Information Systems, 29, 293, 10.1016/S0306-4379(03)00072-3 Tikk, D., & Biró, G. (2003). Experiments with multi-label text classifier on the Reuters collection. In Proceedings of the international conference on computational cybernetics (ICCC 03) (pp. 33–38). Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In: Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 327–335). Wang, 1996, Beyond accuracy: What data quality means to data consumers, Journal of Management Information Systems, 12, 5, 10.1080/07421222.1996.11518099 Wilson, 2006, Recognizing strong and weak opinion clauses, Computational Intelligence, 22, 73, 10.1111/j.1467-8640.2006.00275.x Yang, 1999, An evaluation of statistical approaches to text categorization, Information Retrieval, 1, 69, 10.1023/A:1009982220290 Youden, 1950, Index for rating diagnostic tests, Cancer, 3, 32, 10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3 Zhu, S., Ji, X., Xu, W., & Gong, Y. (2005). Multi-labelled classification using maximum entropy method. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 274–281).