Class Noise vs. Attribute Noise: A Quantitative Study

Xingquan Zhu1, Xindong Wu1
1Department of Computer Science, University of Vermont, Burlington, USA 05405

Tóm tắt

Từ khóa


Tài liệu tham khảo

Allison, P.D. (2002). Missing Data. Thousand Oaks, CA:Sage.

Bansal, N., Chawla, S. & Gupta, A. (2000). Error Correction in Noisy Datasets Using Graph Mincuts.Project Report, Carnegie Mellon University, http://www.cs.cmu. edu/15781/web/proj/chawla.ps.

Batista, G. & Monard, M.C. (2003). An Analysis of Four Missing Data Treatment Methods for Supervised Learning. Applied Artificial Intelligence 17:519?533.

Blake, C.L. & Merz, C.J. (1998). UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/mlearn/MLRepository.html

Brodley, C.E. & Friedl, M.A. (1996). Identifying and Eliminating Mislabeled Training Instances. Proc. of 13th National Conf. on Artificial Intelligence, 799?805.

Brodley, C.E. & Friedl, M.A. (1999). Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research 11:131?167.

Bruha, I. & Franek, F. (1996). Comparison of Various Routines for Unknown Attribute Value Processing the Covering Paradigm. International Journal of Pattern Recognition and Artificial Intelligence 10 (8):939?955.

Bruha, I. (2002). Unknown Attributes Values Processing by Meta-learner. Foundations of Intelligent Systems, 13th International Symposium, 451?461.

Cendrowska, J. (1987). Prism:An Algorithm for Inducing Modular Rules. International Journal of Man-Machines Studies 27:349?370.

Clark, P. & Niblett, T. (1989). The CN2 induction algorithm.Machine Learning 3 (4): 261?283.

Cohen, J. & Cohen, P. (1983). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (2nd ed.), Hillsdale, NJ: Erlbaum.

Dave´, R. (1991). Characterization and Detection of Noise in Clustering. Pattern Rec-ognition Letter 12:657?664.

Domingos, P. & Pazzani, M. (1996). Beyond Independence:Conditions for the Optimality of Simple Bayesian Classifier. In Proceedings of the 13th International Conference on Machine Learning, pp. 105?112.

Everitt, B.S. (1977). The Analysis of Contingency Tables. Chapman and Hall.

Freitas, A. (2001). Understanding the Crucial Role of Attribute Interactions in Data Mining. Artificial Intelligence Review 16 (3):177?199.

Gamberger, D., Lavrac, N. & Groselj, C. (1999). Experiments with Noise Filtering in a Medical Domain. Proc. of 16th CML Conference, San Francisco, CA, 143?151.

Gamberger, D., Lavrac, N. & Dzeroski, S. (2000). Noise Detection and Elimination in Data Preprocessing:experiments in medical domains. Applied Artificial Intelligence 14:205?223.

Guyon, I., Matic, N. & Vapnik, V. (1996). Discovering Information Patterns and Data Cleaning.Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, pp. 181?203.

Hickey, R. (1996). Noise Modeling and Evaluating Learning from Examples. Artificial Intelligence 82 (1?2):157?179.

Holte, R.C. (1993). Very Simple Classification Rules Perform well on Most Commonly Used Datasets. Machine Learning 11:1993.

Hoppner, F. (2003). A Biography Index of References Related to Noise Handling, http:// public.rz.fhwolfenbuettel.de/hoeppnef/bib/keyword/NOISE-HANDLING.html

Howell, D.C. (2002). Treatment of Missing Data, Technical Report, University of Vermont, http://www.uvm.edu/dhowell/StatPages/More_Stu./Missing_Data/ Missing.html

Huang, C. & Lee, H. (2001). A grey-based Nearest Neighbor Approach for Predicting Missing Attribute Values. Proc. of 2001 National Computer Symposium, Taiwan, NSC-90-2213-E-011-052.

Hunt, E.B., Martin, J. & Stone, P. (1966). Experiments in Induction. New York: Academic Press. IBM Synthetic Data.IBM Almaden Research, Synthetic classification data generator, http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html# classSynData

John, G.H. (1995). Robust Decision Trees: Removing Outliers from Databases. Proc. of the First International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp.174?179.

Kubica, J. & Moore, A. (2003). Probabilistic Noise Identification and Data Cleaning. Proceedings of Third IEEE International Conference on Data Mining, Florida.

Langley, P., Iba, W. & Thompson, K. (1992). An Analysis of Bayesian Classifiers. Proceedings of AAAI-92, 223?228.

Little, R.J.A. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley: New York.

Maletic, J. & Marcus, A. (2000). Data Cleansing:Beyond Integrity Analysis. Proceedings of the Conference on Information Quality (IQ2000).

Oak, N & Yoshida, K. (1993). Learning regular and irregular examples separately. Proc. of IEEE International Joint Conference on Neural Networks, 171?174.

Oak, N. & Yoshida, K. (1996). A noise-tolerant hybrid model of a global and a local learning model. Proc. of AAAI-96 Workshop: Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithm, 95?100.

Orr, K. (1998). Data Quality and Systems Theory. CACM 41 (2):66?71.

Quinlan, J.R. (1983). Learning from Noisy Data.Proceedings of the Second International Machine Learning Workshop, University of Illinois at Urbana-Champaign.

Quinlan, J.R. (1986a). Induction of Decision Trees. Machine Learning 1 (1):81?106.

Quinlan, J.R. (1986b). The Effect of Noise on Concept Learning. In Michalski, R.S., Carboneel, J.G. & Mitchell, T.M. (eds.), Machine Learning, Morgan Kaufmann.

Quinlan, J.R. (1993). C4.5:Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

Quinlan, J.R. (1989). Unknown Attribute Values in Induction. Proceedings of 6th International Workshop on Machine Learning, 164?168.

Ragel, A. & Cremilleus, B. (1999). MVC?a preprocessing method to Deal with Missing Values. Knowledge-Based Systems, 285?291.

Redman, T. (1998). The Impact of Poor Data Quality on the Typical Enterprise. CACM 41 (2):79?82.

Redman, T. (1996). Data Quality for the Information Age. Artech House.

Schaffer, C. (1992). Sparse Data and the Effect of Overfitting Avoidance in Decision Tree Induction. Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI), San Jose, CA. pp.147?152.

Schaffer, C. (1993). Over tting Avoidance as Bias. Machine Learning 10:153?178.

Srinivasan, A., Muggleton, S. & Bain, M. (1992). Distinguishing Exception from Noise in Non-monotonic Learning. Proc.of 2nd Inductive Logic Programming Workshop, pp.97?107.

Teng, M. (1999). Correcting Noisy Data. Proceedings of the Sixteenth International Conference on Machine Learning, pp. 239?248.

Wang, R., Storey, V. & Firth, C. (1995). A Framework for Analysis of Data Quality Research. IEEE Transactions on Knowledge and Data Engineering 7 (4):623?639.

Wang, R., Strong, D. & Guarascio, L. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 12 (4):5?34.

Weisberg, S. (1980). Applied Linear Regression. John Wiley and Sons, Inc.

Wu, X. (1995). Knowledge Acquisition from Databases. Ablex Pulishing Corp.

Zhao, Q. & Nishida, T. (1995). Using Qualitative Hypotheses to Identify Inaccurate Data. Journal of Artificial Intelligence Research 3, pp. 119?145.

Zhu, X., Wu, X. & Chen, S. (2003a). Eliminating class noise in large datasets. Proceedings of the 20th ICML International Conference on Machine Learning, Washington D.C. pp.920?927.

Zhu, X., Wu, X. & Chen, Q. (2003b). Identifying Class Noise in Large, Distributed Datasets.Technical Report, University of Vermont, http://www.cs.uvm.edu/tr/CS-03-12.shtml.

Zhu, X., Wu, X. & Yang, Y. (2004). Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004), San Jose, CA.