Preserving confidentiality of high-dimensional tabulated data: Statistical and computational issues
Tóm tắt
Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
Tài liệu tham khảo
Bishop Y.M.M., Fienberg S.E., and Holland P.W. 1975. Discrete Multivariate Analyses: Theory and Practice. MIT Press, Cambridge, MA.
Buzzigoli L. and Giusti A. 1999.Analgorithm to calculate the lower and upper bounds of the elements of an array given its marginals. In: Statistical Data Protection (SDP'98) Proceedings, Luxembourg, Eurostat. pp. 131–147.
Cormen T.H., Leiserson C.E., and Rivest R.L. 1990. Introduction to Algorithms. MIT Press/McGraw-Hill.
Dobra A. 2002. Statistical Tools for Disclosure Limitation in Multiway Contingency Tables. PhD thesis, Department of Statistics, Carnegie Mellon University.
Dobra A. and FienbergS.E. 2000. Bounds for cell entries in contingency tables given marginal totals and decomposable graphs. Proc. Nat. Acad. Sciences 97: 11885–11892.
Dobra A. and Fienberg S.E. 2001. Bounds for cell entries in contingency tables induced by fixed marginal totals with applications to disclosure limitation. Statist. J. United Nations ECE 18: 363–371, 2001. Presented at the 2nd Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, 14-16 March, Skopje, Macedonia.
Dobra A., Karr A.F., Fienberg S.E., and Sanil A.P. 2002. Software systems for tabular data releases. Int. J. Uncertainty, Fuzziness and Knowledge Based Systems 10(5): 529–544.
Fienberg S.E. 1999. Fréchet and bonferroni bounds for multi-way tables of counts with applications to dis-closure limitation. In: Statistical Data Protection (SDP'98) Proceedings, Luxembourg, Eurostat. pp. 115–129.
Harinarayan V., Rajaraman A., and Ullman, J.D. 1996. Implementing data cubes efficiently. In: Proceedings of theACMSIGMODInternational Conference on Management of Data, Vol. 25(2) of ACM SIGMOD Record, June 4-6, ACM Press, New York, pp. 205–216.
Karr A.F., Dobra A., and Sanil A.P. 2003. Table servers protect confidentiality in tabular data releases. Comm. ACM 46(1): 57–58.
Knuth D.E. 1997. Fundamental Algorithms, Vol. 1 of The Art of Computer Programming, 3rd edn. Addison-Wesley, Reading, Massachusetts.
Lauritzen S.L. 1996. Graphical Models. Clarendon Press, Oxford, UK.
Madigan D. and York J. 1995. Bayesian graphical models for discrete data. Internat. Statis. Rev. 63: 215–232.
Moore A.W. and Lee M.S. 1998. Cached sufficient statistics for efficient machine learning with large datasets. J. Artificial Intell. Res. 8: 67–91.
Bjarne Stroustrup. 1997. The C++ Programming Language, 3rd edn. Addison-Wesley, Reading, MA.
Sun Microsystems. Java programming language. http://java.sun.com, 2002.
Willenborg L.C.R.J. and de Waal T. 2001. Elements of Statistical Disclosure Control. Springer-Verlag, New York.