Detection of Outliers in Geochemical Data Using Ensembles of Subsets of Variables

Mathematical Geosciences - Tập 50 - Trang 369-380 - 2017
Mehala Balamurali1, Arman Melkumyan1
1Australian Centre for Field Robotics, The University of Sydney, Sydney, Australia

Tóm tắt

Geochemical data used in geological interpretation of mine deposits and identification of geological domains often contain outliers. Undertaking statistically sound and robust decision-making about outliers (such as deciding whether observations under consideration belong to a given domain) can be a challenging task. Traditional statistical procedures are often poorly suited to the noisy, intrinsically multivariate and high-dimensional nature of geochemical data. We present herein a novel approach for detecting outliers robustly in large multi-dimensional geochemical data. The approach incorporates a feature selection method that automatically seeks the best subset of chemical ratios that, together with the original chemical variables, best represent the inherent characteristics of the data. The proposed approach robustly distinguishes outliers even at high contamination levels. Experimental results demonstrating the advantages of the proposed feature selection algorithm over previous methods used in outlier detection are shown using data from an iron ore deposit located in the Brockman Iron Formation of Hamersley Province, Western Australia.

Tài liệu tham khảo

Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, London Aristides G, Piotr I, Rajeev M (1999) Similarity search in high dimensions via hashing. IN: Proceedings of the 25th VLDB conference, Edinburgh, pp 518–529 Balamurali M, Melkumyan A (2014) Automated identification of geological domains for exploration assays with ambiguous initial domain assignment in an iron ore deposit. In: Proceedings of the ninth international mining geology conference, Adelaide, pp 99–106 Balamurali M, Melkumyan A (2015) Multivariate outlier detection in geochemical data. In: The 17th annual conference of the international association for mathematical geosciences. International Association of Mathematical Geosciences, Freiberg Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, Chichester. https://doi.org/10.1016/0169-2070(95)00625-7 Beckman RJ, Cook RD (1983) Outliers. Technometrics 25:119–163 Breiman L (1998) Arcing classifier. (Discussion paper). Ann Stat 26(3):801–849. https://doi.org/10.1214/aos/1024691079 Clout JMF (2006) Iron formation-hosted iron ores in the Hamersley Province of Western Australia. The Institute and The AusIMM, Carlton Egozcue JJ, Pawlowsky-Glahn V (2006) Simplicial geometry for compositional data. In: Pawlowsky-Glahn V, Mateu-Figueras G, Buccianti A (eds) Compositional data analysis: theory and applications. Geological Society of London, London, pp 145–160 Filzmoser P, Hron K (2013) Robustness for compositional data. In: Kuhnt S, Fried R, Becker C (eds) Robustness and complex data structures. Springer, Berlin Heidelberg, pp 117–131 Filzmoser P, Reimann C, Garrett RG (2003) Multivariate outlier detection in exploration geochemistry. Department of Statistics, Vienna University of Technology, Vienna Gnanadesikan R, Kettenring JR (1972) Robust estimates, residuals and outlier detection with multi-response data. Biometrics 28:81–124 Hampel FR, Rousseeuw PJ, Ronchetti EM, Strahel WA (1986) Robust statistics: the approach based on influence functions. Wiley, New York Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312 Morris RC (1980) A textural and mineralogical study of the relationship of iron ore to banded iron-formation in the hamersley iron province of Western Australia. Econ Geol 75:184–209 Rencher AC (2002) Methods of multivariate analysis, 2nd edn. Wiley, New York, p 708 Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Vincze I, Wertz W (eds) Mathematical statistics and applications, vol B. Akade’miai Kiado’, Budapest, pp 283–297 Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York Rousseeuw PJ, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics. https://doi.org/10.1080/00401706. 10485670 Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517 Strehl A, Ghosh J (2003) Cluster ensembles-knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617 Thorne W, Hagemannm S, Webb A, Clout J (2008) Banded iron formation-related iron ore deposits of the Hamersley Province, Western Australia. In: Hagemannm S, Rosiere C, Gutzmer J, Beukes NJ (eds) Banded iron formation-related high grade iron ore. Society of Economic Geologists, Littleton, pp 197–221