Detection of Outliers in Geochemical Data Using Ensembles of Subsets of Variables
Tóm tắt
Geochemical data used in geological interpretation of mine deposits and identification of geological domains often contain outliers. Undertaking statistically sound and robust decision-making about outliers (such as deciding whether observations under consideration belong to a given domain) can be a challenging task. Traditional statistical procedures are often poorly suited to the noisy, intrinsically multivariate and high-dimensional nature of geochemical data. We present herein a novel approach for detecting outliers robustly in large multi-dimensional geochemical data. The approach incorporates a feature selection method that automatically seeks the best subset of chemical ratios that, together with the original chemical variables, best represent the inherent characteristics of the data. The proposed approach robustly distinguishes outliers even at high contamination levels. Experimental results demonstrating the advantages of the proposed feature selection algorithm over previous methods used in outlier detection are shown using data from an iron ore deposit located in the Brockman Iron Formation of Hamersley Province, Western Australia.
Tài liệu tham khảo
Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, London
Aristides G, Piotr I, Rajeev M (1999) Similarity search in high dimensions via hashing. IN: Proceedings of the 25th VLDB conference, Edinburgh, pp 518–529
Balamurali M, Melkumyan A (2014) Automated identification of geological domains for exploration assays with ambiguous initial domain assignment in an iron ore deposit. In: Proceedings of the ninth international mining geology conference, Adelaide, pp 99–106
Balamurali M, Melkumyan A (2015) Multivariate outlier detection in geochemical data. In: The 17th annual conference of the international association for mathematical geosciences. International Association of Mathematical Geosciences, Freiberg
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Chichester. https://doi.org/10.1016/0169-2070(95)00625-7
Beckman RJ, Cook RD (1983) Outliers. Technometrics 25:119–163
Breiman L (1998) Arcing classifier. (Discussion paper). Ann Stat 26(3):801–849. https://doi.org/10.1214/aos/1024691079
Clout JMF (2006) Iron formation-hosted iron ores in the Hamersley Province of Western Australia. The Institute and The AusIMM, Carlton
Egozcue JJ, Pawlowsky-Glahn V (2006) Simplicial geometry for compositional data. In: Pawlowsky-Glahn V, Mateu-Figueras G, Buccianti A (eds) Compositional data analysis: theory and applications. Geological Society of London, London, pp 145–160
Filzmoser P, Hron K (2013) Robustness for compositional data. In: Kuhnt S, Fried R, Becker C (eds) Robustness and complex data structures. Springer, Berlin Heidelberg, pp 117–131
Filzmoser P, Reimann C, Garrett RG (2003) Multivariate outlier detection in exploration geochemistry. Department of Statistics, Vienna University of Technology, Vienna
Gnanadesikan R, Kettenring JR (1972) Robust estimates, residuals and outlier detection with multi-response data. Biometrics 28:81–124
Hampel FR, Rousseeuw PJ, Ronchetti EM, Strahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York
Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Morris RC (1980) A textural and mineralogical study of the relationship of iron ore to banded iron-formation in the hamersley iron province of Western Australia. Econ Geol 75:184–209
Rencher AC (2002) Methods of multivariate analysis, 2nd edn. Wiley, New York, p 708
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications, vol B. Akade’miai Kiado’, Budapest, pp 283–297
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York
Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics. https://doi.org/10.1080/00401706. 10485670
Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Strehl A, Ghosh J (2003) Cluster ensembles-knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Thorne W, Hagemannm S, Webb A, Clout J (2008) Banded iron formation-related iron ore deposits of the Hamersley Province, Western Australia. In: Hagemannm S, Rosiere C, Gutzmer J, Beukes NJ (eds) Banded iron formation-related high grade iron ore. Society of Economic Geologists, Littleton, pp 197–221