Maintaining bounded-size sample synopses of evolving datasets

The VLDB Journal - Tập 17 - Trang 173-201 - 2007
Rainer Gemulla1, Wolfgang Lehner1, Peter J. Haas2
1Technische Universität Dresden, Dresden, Germany
2IBM Almaden Research Center, San Jose, USA

Tóm tắt

Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sample can be very convenient from a system-design point of view, because the task of memory management is simplified, especially when many samples are maintained simultaneously. In this paper, we study methods for incrementally maintaining a bounded-size uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP), that maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the 45-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm of Babcock et al. when the insertions and deletions correspond to a moving window over a data stream. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically resizing a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size. We also show how to merge uniform samples from disjoint datasets to obtain a uniform sample of the union of the datasets; the merged sample can be incrementally maintained. Our new RPMerge algorithm extends the HRMerge algorithm of Brown and Haas to effectively deal with deletions, thereby facilitating efficient parallel sampling.

Tài liệu tham khảo

Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proc. SODA, pp. 633–634 (2002) Brown P., Haas P., Myllymaki J., Pirahesh H., Reinwald B. and Sismanis Y. (2005). Toward automated large-scale information integration and discovery. In: Härder, T. and Lehner, W. (eds) Data Management in a Connected World, pp 161–180. Springer, Heidelberg Brown, P., Haas, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relational data. In: Proc. VLDB, pp. 668–679 (2003) Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: Proc. ICDE (2006) Chaudhuri, S., Motwani, R., Narasayya, V.R.: On random sampling over joins. In: Proc. ACM SIGMOD, pp. 263–274 (1999) Colt Library: Open source libraries for high performance scientific and technical computing in Java. http://dsd.lbl.gov/ hoschek/colt/ Cormode, G., Muthukrishnan, S., Rozenbaum, I.: Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In: Proc. VLDB, pp. 25–36 (2005) Fan C., Muller M. and Rezucha I. (1962). Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Am. Statist. Assoc. 57: 387–402 Frahling, G., Indyk, P., Sohler, C.: Sampling in dynamic data streams and applications. In: Proc. 21st Symp. Computat. Geom., pp. 142–149 (2005) Gemulla, R., Lehner, W.: Deferred maintenance of disk-based random samples. In: Proc. EDBT, pp. 423–441 (2006) Gemulla, R., Lehner, W., Haas, P.J.: A dip in the reservoir: Maintaining sample synopses of evolving datasets. In: Proc. VLDB, pp. 595–606 (2006) Gemulla, R., Lehner, W., Haas, P.J.: Maintaining Bernoulli samples over evolving multisets. In: Proc. ACM PODS, pp. 93–102 (2007) Gibbons P., Matias Y. and Poosala V. (1997). AQUA project white paper. Tech. rep., Bell Laboratories, Murray Hill Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proc. ACM SIGMOD, pp. 331–342 (1998) Gibbons P.B., Matias Y. and Poosala V. (2002). Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst. 27: 182–184 GSL: GNU Scientific Library. http://www.gnu.org/software/gsl/ Haas, P., König, C.: A bi-level Bernoulli scheme for database sampling. In: Proc. ACM SIGMOD, pp. 275–286 (2004) Haas, P.J.: Data stream sampling: Basic techniques and results. In: Garofalakis, M., Gehrke, J., Rastogi, R. (eds.) Data Stream Management: Processing High Speed Data Streams, Springer, Heidelberg (2007) Halevy, A.Y., Etzioni, O., Doan, A., Ives, Z.G., Madhavan, J., McDowell, L., Tatarinov, I.: Join synopses for approximate query answering. In: Proc. CIDR (2003) Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proc. ACM SIGMOD, pp. 171–182 (1997) IBM Corporation: WebSphere Profile Stage User’s Manual (2005) Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: automatic discovery of correlations and soft functional dependencies. In: Proc. ACM SIGMOD, pp. 647–658 (2004) Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In: Proc. ACM SIGMOD, pp. 299–310 (2004) John, G.H., Langley, P.: Static versus dynamic sampling for data mining. In: Proc. KDD, pp. 367–370 (2005) Johnson N.L., Kotz S. and Kemp A.W. (1992). Discrete Univariate Distributions, 2nd edn. Wiley, New York Kachitvichyanukul V. and Schmeiser B. (1985). Computer generation of hypergeometric random variables. J. Stat. Comput. Simul 22: 127–145 Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proc. ACM PODS, pp. 77–85 (1994) Knuth, D.E.: The Art of Computer Programming, vol. 2: Seminumerical Algorithms, 1st edn. Addison-Wesley, Reading (1969) Law A.M. (2007). Simulation Modeling and Analysis, 4th edn. McGraw-Hill, New York L’Ecuyer P. (2006). Uniform random number generation. In: Henderson, S.G. and Nelson, B.L. (eds) Simulation, pp 55–81. Elsevier, Amsterdam Leser, U., Naumann, F.: (Almost) hands-off information integration for the life sciences. In: Proc. CIDR, pp. 131–143 (2005) Matsumoto M. and Nishimura T. (1998). Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul. 8(1): 3–30 McLeod A.I. and Bellhouse D.R. (1983). A convenient algorithm for drawing a simple random sample. Appl. Statist. 32: 182–184 Norris J.R. (1997). Markov Chains. Cambridge University Press, Cambridge Olken, F.: Random sampling from databases. Thesis LBL-32883, Information and Computing Sciences Division, Lawrence Berkeley National Laboratory (1993) Olken, F., Rotem, D.: Maintenance of materialized views of sampling queries. In: Proc. ICDE (1992) Poosala, V., Haas, P.J., Ioannidis, Y.E., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proc. ACM SIGMOD, pp. 294–305 (1996) Press W.H., Teukolsky S.A., Vetterling W.T. and Flannery B.P. (1992). Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge Robbins H. and Monro S. (1951). A stochastic approximation method. Ann. Math. Statist. 22: 400–407 Ross S.M. (1983). Stochastic Processes. Wiley, New York Särndal C.E., Swensson B. and Wretman J. (1992). Model Assisted Survey Sampling. Springer, Heidelberg Spall J.C. (2003). Introduction to Stochastic Search and Optimization. Wiley, New York Tatbul, N., Çetintemel, U., Zdonik, S.B., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Proc. VLDB, pp. 309–320 (2003) Vitter J.S. (1984). Faster methods for random sampling. Commun. ACM 27(7): 703–718 Vitter J.S. (1985). Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1): 37–57 Zechner, H.: Efficient sampling from continuous and discrete distributions. Ph.D. thesis, Technical University Graz (1997)