Knowledge discovery from imbalanced and noisy data
Tóm tắt
Từ khóa
Tài liệu tham khảo
A. Asuncion, D. Newman, UCI machine learning repository <http://www.ics.uci.edu/~mlearn/MLRepository.html>, 2007.
R. Barandela, R.M. Valdovinos, J.S. Sanchez, F.J. Ferri, The imbalanced training sample problem: under or over sampling? In Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR’04), Lecture Notes in Computer Science 3138, 2004, pp. 806–814.
Batista, 2004, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Exploration Newsletter, 6, 20, 10.1145/1007730.1007735
Berenson, 1983
C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled training instances, in: Proceedings of 13th National Conference on Artificial Intelligence, AAAI Press, 1996, pp. 799–805.
Brodley, 1999, Identifying mislabeled training data, Journal of Artificial Intelligence Research, 11, 131, 10.1613/jair.606
Cao, 2008, Mining impact-targeted activity patterns in imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 20, 1053, 10.1109/TKDE.2007.190635
Chawla, 2008, Automatically countering imbalance and its empirical relationship to cost, Data Mining and Knowledge Discovery, 17, 225, 10.1007/s10618-008-0087-0
Chawla, 2002, SMOTE: synthetic minority oversampling technique, Journal of Artificial Intelligence Research, 321, 10.1613/jair.953
C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in: Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.
W. Fan, S.J. Stolfo, J. Zhang, P.K. Chan, AdaCost: misclassification cost-sensitive boosting, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1999, pp. 97–105.
Y. Feng, Z. Wu, Z. Zhou, Enhancing reliability throughout knowledge discovery process, in: Sixth IEEE International Conference on Data Mining – Reliability Issues in Knowledge Discovery Workshop (RIKD06), 2006, pp. 754–758.
Fenton, 1997
A. Folleco, T.M. Khoshgoftaar, J. Van Hulse, L. Bullard, Identifying learners robust to low quality data, in: Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2008), Las Vegas, NV, 2008, pp. 190–195.
D. Gamberger, N. Lavrač, C. Groselj, Experiments with noise filtering in a medical domain, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, 1999, pp. 143–153.
H. Han, W.Y. Wang, B.H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing (ICIC’05). Lecture Notes in Computer Science 3644, Springer-Verlag, 2005, pp. 878–887.
Hand, 2005, Good practice in retail credit scorecard assessment, Journal of the Operational Research Society, 56, 1109, 10.1057/palgrave.jors.2601932
N. Japkowicz, Learning from imbalanced data sets: a comparison of various strategies, in: AAAI Workshop on Learning from Imbalanced Data Sets (AAAI’00), 2000, pp. 10–15.
N. Japkowicz, Class imbalances: are we focusing on the right issue? In Workshop on Learning from Imbalanced Datasets II, ICML, 2003.
Japkowicz, 2002, The class imbalance problem: a systematic study, Intelligent Data Analysis, 6, 429, 10.3233/IDA-2002-6504
Jo, 2004, Class imbalances versus small disjuncts, SIGKDD Explorations, 6, 40, 10.1145/1007730.1007737
M.V. Joshi, V. Kumar, R.C. Agarwal, Evaluating boosting algorithms to classify rare classes: comparison and improvements, in: Proceedings of IEEE International Conference on Data Mining, November 2001, pp. 257–264.
Khoshgoftaar, 1998, Classification of fault-prone software modules: prior probabilities, costs and model evaluation, Empirical Software Engineering, 3, 275, 10.1023/A:1009736205722
T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano, A. Folleco, Learning with limited minority class data, in: Proceedings of the Sixth IEEE International Conference on Machine Learning and Applications (ICMLA’07), IEEE Computer Society, Cincinnati, OH, 2007, pp. 348–353.
Khoshgoftaar, 2004, Comparative assessment of software quality classification techniques: an empirical case study, Empirical Software Engineering Journal, 9, 229, 10.1023/B:EMSE.0000027781.18360.9b
T.M. Khoshgoftaar, N. Seliya, The necessity of assuring quality in software measurement data, in: Proceedings of 10th International Software Metrics Symposium, IEEE Computer Society, Chicago, IL, September 2004, pp. 119–130.
Khoshgoftaar, 2005, Detecting noisy instances with the rule-based classification model, Intelligent Data Analysis: An International Journal, 9, 347, 10.3233/IDA-2005-9403
Khoshgoftaar, 2005, Enhancing software quality estimation using ensemble-classifier based noise filtering, Intelligent Data Analysis: An International Journal, 9, 3, 10.3233/IDA-2005-9102
M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one sided selection, in: Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 179–186.
Li, 2008, Fuzzy relevance vector machine for learning from unbalanced data and noise, Pattern Recognition Letter, 29, 1175, 10.1016/j.patrec.2008.01.009
Little, 2002
M. Maloof, Learning when data sets are imbalanced and when costs are unequal and unknown, in: Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets, 2003.
R. Prati, G. Batista, M. Monard, Learning with class skews and small disjuncts, in: XVIIth Brazilian Symposium on Artificial Intelligence (SBIA’04). Lecture Notes in Computer Science 3171, Springer-Verlag, 2004, pp. 296–306.
F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for comparing induction algorithms, in: Proceedings of the 15th International Conference on Machine Learning (IMLC-98), 1998.
S. Ramasway, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, in: Proceedings of ACM SIGMOD Conference on Management of Data, ACM, 2000, pp. 427–438.
SAS Institute, SAS/STAT User’s Guide. SAS Institute Inc., 2004.
Sun, 2007, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition, 40, 3358, 10.1016/j.patcog.2007.04.009
C.M. Teng, Correcting noisy data, in: Proceedings of Sixth International Conference Machine Learning (ICML 99), Morgan Kaufmann, 1999, pp. 239–248.
Thomas, 2002, Credit scoring and its applications, SIAM Monographs on Mathematical Modeling and Computation
Van Hulse, 2006, Class noise detection using frequent itemsets, Intelligent Data Analysis: An International Journal, 10, 487, 10.3233/IDA-2006-10602
J. Van Hulse, T.M. Khoshgoftaar, H. Huang, The pairwise attribute noise detection algorithm. Knowledge and Information Systems Journal, Special Issue on Mining Low Quality Data 11(2) (2007) 171–190.
J. Van Hulse, T.M. Khoshgoftaar, A. Napolitano, Experimental perspectives on learning from imbalanced data, in: Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007), Corvalis, OR, June 2007, pp. 935–942.
J. Van Hulse, T.M. Khoshgoftaar, A. Napolitano, Skewed class distributions and mislabeled examples, in: Proceedings of the Seventh IEEE International Conference on Data Mining – Workshops (ICDM’07), Omaha, NE, 2007, pp. 477–482.
G. Weiss, K. McCarthy, B. Zabar, Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? in: Proceedings of the 2007 International Conference on Data Mining, CSREA Press, Las Vegas, NV, USA, 2007, pp. 35–41.
Weiss, 2004, Mining with rarity: a unifying framework, SIGKDD Explorations, 6, 7, 10.1145/1007730.1007734
Weiss, 2003, Learning when training data are costly: the effect of class distribution on tree induction, Journal of Artificial Intelligence Research, 315, 10.1613/jair.1199
Witten, 2005
Wohlin, 2000
Zhu, 2004, Class noise vs. attribute noise: a quantitative study of their impacts, Artificial Intelligence Review, 22, 177, 10.1007/s10462-004-0751-8
X. Zhu, X. Wu, Cost-guided class noise handling for effective cost-sensitive learning, in: Fourth IEEE International Conference on Data Mining (ICDM 2004), November 2004, pp. 297–304.