Knowledge discovery from imbalanced and noisy data

Data and Knowledge Engineering - Tập 68 Số 12 - Trang 1513-1542 - 2009

Jason Van Hulse¹, Taghi M. Khoshgoftaar¹

¹Empirical Software Engineering Laboratory, Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL 33431, United States

Tóm tắt

Từ khóa

Tài liệu tham khảo

A. Asuncion, D. Newman, UCI machine learning repository <http://www.ics.uci.edu/~mlearn/MLRepository.html>, 2007.

R. Barandela, R.M. Valdovinos, J.S. Sanchez, F.J. Ferri, The imbalanced training sample problem: under or over sampling? In Joint IAPR International Workshops on Structural, Syntactic, and Statistical Pattern Recognition (SSPR/SPR’04), Lecture Notes in Computer Science 3138, 2004, pp. 806–814.

Batista, 2004, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Exploration Newsletter, 6, 20, 10.1145/1007730.1007735

Berenson, 1983

Breiman, 2001, Random forests, Machine Learning, 45, 5, 10.1023/A:1010933404324

C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled training instances, in: Proceedings of 13th National Conference on Artificial Intelligence, AAAI Press, 1996, pp. 799–805.

Brodley, 1999, Identifying mislabeled training data, Journal of Artificial Intelligence Research, 11, 131, 10.1613/jair.606

Cao, 2008, Mining impact-targeted activity patterns in imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 20, 1053, 10.1109/TKDE.2007.190635

Chawla, 2008, Automatically countering imbalance and its empirical relationship to cost, Data Mining and Knowledge Discovery, 17, 225, 10.1007/s10618-008-0087-0

Chawla, 2002, SMOTE: synthetic minority oversampling technique, Journal of Artificial Intelligence Research, 321, 10.1613/jair.953

C. Drummond, R.C. Holte, C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in: Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, 2003.

W. Fan, S.J. Stolfo, J. Zhang, P.K. Chan, AdaCost: misclassification cost-sensitive boosting, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1999, pp. 97–105.

Y. Feng, Z. Wu, Z. Zhou, Enhancing reliability throughout knowledge discovery process, in: Sixth IEEE International Conference on Data Mining – Reliability Issues in Knowledge Discovery Workshop (RIKD06), 2006, pp. 754–758.

Fenton, 1997

A. Folleco, T.M. Khoshgoftaar, J. Van Hulse, L. Bullard, Identifying learners robust to low quality data, in: Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI 2008), Las Vegas, NV, 2008, pp. 190–195.

D. Gamberger, N. Lavrač, C. Groselj, Experiments with noise filtering in a medical domain, in: Proceedings of the 16th International Conference on Machine Learning, Morgan Kaufmann, 1999, pp. 143–153.

H. Han, W.Y. Wang, B.H. Mao, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International Conference on Intelligent Computing (ICIC’05). Lecture Notes in Computer Science 3644, Springer-Verlag, 2005, pp. 878–887.

Hand, 2005, Good practice in retail credit scorecard assessment, Journal of the Operational Research Society, 56, 1109, 10.1057/palgrave.jors.2601932

N. Japkowicz, Learning from imbalanced data sets: a comparison of various strategies, in: AAAI Workshop on Learning from Imbalanced Data Sets (AAAI’00), 2000, pp. 10–15.

N. Japkowicz, Class imbalances: are we focusing on the right issue? In Workshop on Learning from Imbalanced Datasets II, ICML, 2003.

Japkowicz, 2002, The class imbalance problem: a systematic study, Intelligent Data Analysis, 6, 429, 10.3233/IDA-2002-6504

Jo, 2004, Class imbalances versus small disjuncts, SIGKDD Explorations, 6, 40, 10.1145/1007730.1007737

M.V. Joshi, V. Kumar, R.C. Agarwal, Evaluating boosting algorithms to classify rare classes: comparison and improvements, in: Proceedings of IEEE International Conference on Data Mining, November 2001, pp. 257–264.

Khoshgoftaar, 1998, Classification of fault-prone software modules: prior probabilities, costs and model evaluation, Empirical Software Engineering, 3, 275, 10.1023/A:1009736205722

T.M. Khoshgoftaar, C. Seiffert, J. Van Hulse, A. Napolitano, A. Folleco, Learning with limited minority class data, in: Proceedings of the Sixth IEEE International Conference on Machine Learning and Applications (ICMLA’07), IEEE Computer Society, Cincinnati, OH, 2007, pp. 348–353.

Khoshgoftaar, 2004, Comparative assessment of software quality classification techniques: an empirical case study, Empirical Software Engineering Journal, 9, 229, 10.1023/B:EMSE.0000027781.18360.9b

T.M. Khoshgoftaar, N. Seliya, The necessity of assuring quality in software measurement data, in: Proceedings of 10th International Software Metrics Symposium, IEEE Computer Society, Chicago, IL, September 2004, pp. 119–130.

Khoshgoftaar, 2005, Detecting noisy instances with the rule-based classification model, Intelligent Data Analysis: An International Journal, 9, 347, 10.3233/IDA-2005-9403

Khoshgoftaar, 2005, Enhancing software quality estimation using ensemble-classifier based noise filtering, Intelligent Data Analysis: An International Journal, 9, 3, 10.3233/IDA-2005-9102

M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one sided selection, in: Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 179–186.

Li, 2008, Fuzzy relevance vector machine for learning from unbalanced data and noise, Pattern Recognition Letter, 29, 1175, 10.1016/j.patrec.2008.01.009

Little, 2002

M. Maloof, Learning when data sets are imbalanced and when costs are unequal and unknown, in: Proceedings of the ICML’03 Workshop on Learning from Imbalanced Data Sets, 2003.

R. Prati, G. Batista, M. Monard, Learning with class skews and small disjuncts, in: XVIIth Brazilian Symposium on Artificial Intelligence (SBIA’04). Lecture Notes in Computer Science 3171, Springer-Verlag, 2004, pp. 296–306.

F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for comparing induction algorithms, in: Proceedings of the 15th International Conference on Machine Learning (IMLC-98), 1998.

S. Ramasway, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, in: Proceedings of ACM SIGMOD Conference on Management of Data, ACM, 2000, pp. 427–438.

SAS Institute, SAS/STAT User’s Guide. SAS Institute Inc., 2004.

Sun, 2007, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition, 40, 3358, 10.1016/j.patcog.2007.04.009

C.M. Teng, Correcting noisy data, in: Proceedings of Sixth International Conference Machine Learning (ICML 99), Morgan Kaufmann, 1999, pp. 239–248.

Thomas, 2002, Credit scoring and its applications, SIAM Monographs on Mathematical Modeling and Computation

Van Hulse, 2006, Class noise detection using frequent itemsets, Intelligent Data Analysis: An International Journal, 10, 487, 10.3233/IDA-2006-10602

J. Van Hulse, T.M. Khoshgoftaar, H. Huang, The pairwise attribute noise detection algorithm. Knowledge and Information Systems Journal, Special Issue on Mining Low Quality Data 11(2) (2007) 171–190.

J. Van Hulse, T.M. Khoshgoftaar, A. Napolitano, Experimental perspectives on learning from imbalanced data, in: Proceedings of the 24th Annual International Conference on Machine Learning (ICML 2007), Corvalis, OR, June 2007, pp. 935–942.

J. Van Hulse, T.M. Khoshgoftaar, A. Napolitano, Skewed class distributions and mislabeled examples, in: Proceedings of the Seventh IEEE International Conference on Data Mining – Workshops (ICDM’07), Omaha, NE, 2007, pp. 477–482.

G. Weiss, K. McCarthy, B. Zabar, Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? in: Proceedings of the 2007 International Conference on Data Mining, CSREA Press, Las Vegas, NV, USA, 2007, pp. 35–41.

Weiss, 2004, Mining with rarity: a unifying framework, SIGKDD Explorations, 6, 7, 10.1145/1007730.1007734

Weiss, 2003, Learning when training data are costly: the effect of class distribution on tree induction, Journal of Artificial Intelligence Research, 315, 10.1613/jair.1199

Witten, 2005

Wohlin, 2000

Zhu, 2004, Class noise vs. attribute noise: a quantitative study of their impacts, Artificial Intelligence Review, 22, 177, 10.1007/s10462-004-0751-8

X. Zhu, X. Wu, Cost-guided class noise handling for effective cost-sensitive learning, in: Fourth IEEE International Conference on Data Mining (ICDM 2004), November 2004, pp. 297–304.

L. Zhuang, H. Dai, Reducing performance bias for unbalanced text mining, in: Sixth IEEE International Conference on Data Mining – Reliability Issues in Knowledge Discovery Workshop (RIKD06), 2006, pp. 770–774.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA