Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation

J. A. Martín-Fernández1, C. Barceló-Vidal1, V. Pawlowsky-Glahn1
1Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Girona, Spain

Tóm tắt

The statistical analysis of compositional data based on logratios of parts is not suitable when zeros are present in a data set. Nevertheless, if there is interest in using this modeling approach, several strategies have been published in the specialized literature which can be used. In particular, substitution or imputation strategies are available for rounded zeros. In this paper, existing nonparametric imputation methods—both for the additive and the multiplicative approach—are revised and essential properties of the last method are given. For missing values a generalization of the multiplicative approach is proposed.

Tài liệu tham khảo

Aitchison, J., 1986, The statistical analysis of compositional data: Chapman and Hall, London, 416p. Aitchison, J., 1997, The one-hour course in compositional data analysis or compositional data analysis is simple, in Pawlowsky-Glahn, V., ed., Proceedings of IAMG'97, The Third Annual Conference of the International Association for Mathematical Geology, Vol.1: International Center for Numerical Methods in Engineering (CIMNE); Barcelona, Spain, p. 3-35. Aitchison, J., 2002, Simplicial inference, in Viana, M. A. G., and Richards, D. S. P., eds., Contemporary mathematics series, Vol.287: Algebraic methods in statistics and probability, American Mathematical Society, Providence, RI, p. 1-22. Aitchison, J., Barceló-Vidal, C., Martín-Fernández, J. A., and Pawlowsky-Glahn, V., 2000, Logratio analysis and compositional distance: Math. Geol., v.32, no.3, p. 271-275. Aitchison, J., and Greenacre, M., 2002, Biplots of compositional data: Appl. Stat., v.51, no.4, p. 375-392. Allison, P. D., 2001, Missing data: Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-136, Thousand Oaks, CA, 93p. Bacon-Shone, J., 1992, Ranking methods for compositional data: Appl. Stat., v.41, no.3, p. 533-537. Barceló-Vidal, C., Martíln-Fernández, J. A., and Pawlowsky-Glahn, V., 2001, Mathematical foundations of compositional data analysis, in Ross, G., ed., Proceedings of IAMG'01, The sixth annual conference of the International Association for Mathematical Geology: Cancun, Mexico, 20p. (CD, electronic publication). Billheimer, D., Guttorp, P., and Fagan, W., 2001, Statistical interpretation of species composition: J. Am. Stat. Assoc., v.96, p. 1205-1214. Bohling, G. C., Davis, J. C., Olea, R. A., and Harff, J., 1996, Singularity and nonnormality in the classification of compositional data: Math. Geol., v.30, no.1, p. 5-20. Cox, T. F., and Cox, M. A., 1994, Multidimensional Scaling: Monographs on statistics and applied probability: Chapman and Hall, London, 213p. Davis, J. C., Harff, J., Olea, R., and Bohling, G. C., 1995, Regionalized classification of the Darss Sill sediments, in Pawlowsky-Glahn, V., ed., Proceedings of IAMG'97, The Third Annual Conference of the International Association for Mathematical Geology, Vol.1: International Center for Numerical Methods in Engineering (CIMNE), Barcelona, p. 145-150. Fry, J. M., Fry, T. R. L., and McLaren, K. R., 1996, Compositional data analysis and zeros in micro data: Centre of Policy Studies (COPS), General Paper no. G-120, Monash University, Clayton, Australia. Krzanowski, W. J., 1988, Principles of multivariate analysis: A user's perspective: Clarendon Press, Oxford, 563p. (reprinted 1996). Little, R. J. A., and Rubin, D. B., 1987, Statistical analysis with missing data: Wiley, New York, 278p. Martín-Fernández, J. A., Barceló-Vidal, C., and Pawlowsky-Glahn, V., 1997, Different classifications of the Darss Sill data set based on mixture models for compositional data, in Pawlowsky-Glahn, V., ed., Proceedings of IAMG'97, The Third Annual Conference of the International Association for Mathematical Geology, Vol.1: International Center for Numerical Methods in Engineering (CIMNE), Barcelona, p. 151-158. Martín-Fernández, J. A., Barceló-Vidal, C., and Pawlowsky-Glahn, V., 1998a, Measures of difference for compositional data and hierarchical clustering methods, in Buccianti, A., Nardi, G., and Potenza, R., eds., Proceedings of IAMG'98, The Fourth Annual Conference of the International Association for Mathematical Geology, Vol.2: De Frede Editore, Napoli, p. 526-531. Martín-Fernández, J. A., Barceló-Vidal, C., and Pawlowsky-Glahn, V., 1998b, A critical approach to nonparametric classification of compositional data, in Rizzi, A., Vichi, M., and Bock, H. H., eds., Advances in data science and classification, Proceedings of the 6th Conference of the International Federation of Classification Societies (IFCS-98), Università La Sapienza, Roma: Springer-Verlag, Berlin, p. 49-56. Martín-Fernández, J. A., Barceló-Vidal, C., and Pawlowsky-Glahn, V., 2000, Zero replacement in compositional data sets, in Kiers, H., Rasson, J., Groenen, P., and Shader, M., eds., Studies in classification, data analysis, and knowledge organization, Proceedings of the 7th Conference of the International Federation of Classification Societies (IFCS'2000), University of Namur, Namur: Springer-Verlag, Berlin, p. 155-160. Martín-Fernández, J. A., Olea-Meneses, R., and Pawlowsky-Glahn, V., 2001, Criteria to compare estimation methods of regionalized compositions: Math. Geol., v.33, no.8, p. 889-909. Mateu-Figueras, G., Barceló-Vidal, C., and Pawlowsky-Glahn, V., 1998, Modeling compositional data with multivariate skew-normal distributions, in Buccianti, A., Nardi, G., and Potenza, R., eds., Proceedings of IAMG'98, The Fourth Annual Conference of the International Association for Mathematical Geology, Vol.1: De Frede Editore, Napoli, p. 532-537. Pawlowsky-Glahn, V., and Egozcue, J. J., 2001, Geometric approach to statistical analysis on the simplex: SERRA, v.15, no.5, p. 384-398. Pawlowsky-Glahn, V., and Egozcue, J. J., 2002, BLU estimators and compositional data: Math. Geol., v.34, no.3, p. 259-274. Sandford, R. F., Pierson, C. T., and Crovelli, R. A., 1993, An objective replacement method for censored geochemical data: Math. Geol., v.25, no.1, p. 59-80. Shafer, J. L., 1997, Analysis of incomplete multivariate data: Chapman and Hall, London, 430p. Tauber, F., 1999, Spurious clusters in granulometric data caused by logratio transformation: Math. Geol., v.31, no.5, p. 491-504. Zhou, D., 1997, Logratio statistical classification and estimation of hydrodynamic parameters from Darss Sill grain-size data, in Pawlowsky-Glahn, V., ed., Proceedings of IAMG'97, The Third Annual Conference of the International Association for Mathematical Geology, Vol.1: International Center for Numerical Methods in Engineering (CIMNE), Barcelona, p. 139-144.