A linear-time probabilistic counting algorithm for database applications

ACM Transactions on Database Systems - Tập 15 Số 2 - Trang 208-229 - 1990
Kyu-Young Whang1, Brad T. Vander-Zanden2, Howard M. Taylor3
1Korea Advanced Institute of Science and Technology, Seoul, Korea
2[Cornell Univ., Ithaca, NY]
3Univ. of Delaware, Newark

Tóm tắt

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O ( q ) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O ( q log q ) time complexity. Our technique, called linear counting , is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.

Từ khóa


Tài liệu tham khảo

10.1016/0306-4379(87)90014-7

10.1145/319983.319987

CONTE , S. D. , AND DE BOOR , C. Elementary Numerical Analysis: An Algorithmic Approach . 3 rd ed., McGraw-Hill , New York , 1980 . CONTE, S. D., AND DE BOOR, C. Elementary Numerical Analysis: An Algorithmic Approach. 3rd ed., McGraw-Hill, New York, 1980.

DEMOLOMBE , R. Estimation of the number of tuples satisfying a query expressed in predicate calculus language . In Proceedings of the 6th International Conference on Very Large Data Bases , 1980 , pp. 55 - 63 . DEMOLOMBE, R. Estimation of the number of tuples satisfying a query expressed in predicate calculus language. In Proceedings of the 6th International Conference on Very Large Data Bases, 1980, pp. 55-63.

FELLER , W. An Introduction to Probability Theory and Its Applications . Vol. 1 , 3 rd ed., Wiley , New York , 1968 . FELLER, W. An Introduction to Probability Theory and Its Applications. Vol. 1, 3rd ed., Wiley, New York, 1968.

10.5555/645910.673455

JOHNSON , N. L. , AND KOTZ , S. Urn Models and Their Application . Wiley , New York , 1977 . JOHNSON, N. L., AND KOTZ, S. Urn Models and Their Application. Wiley, New York, 1977.

MOOD , A. M. , GRAYBILL , F. A. , AND BOES , D.C. Introduction to the Theory of Statistics . 3 rd ed., McGraw-Hill , New York , 1974 . MOOD, A. M., GRAYBILL, F. A., AND BOES, D.C. Introduction to the Theory of Statistics. 3rd ed., McGraw-Hill, New York, 1974.

10.5555/645913.671474

10.1109/TSE.1985.231855

10.1145/582095.582099

10.1214/aoms/1177729794

VANDER-ZANDEN , B. W. , TAYLOR , H. M. , AND BITTON , D. Estimating block accesses when attributes are correlated . In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto , Aug. 1986 ), pp. 119 - 127 . VANDER-ZANDEN, B. W., TAYLOR, H. M., AND BITTON, D. Estimating block accesses when attributes are correlated. In Proceedings of the 12th International Conference on Very Large Data Bases (Kyoto, Aug. 1986), pp. 119-127.

10.1145/182.358452

10.1109/TC.1984.1676418

WIEDERHOLD , G. , AND EL-MARS 1, R. The structural model for database design , in Proceedings of the International Conference on Entity Relationship Approach ( Los Angeles , Dec. 1979 ), pp. 247 - 267 . WIEDERHOLD, G., AND EL-MARS1, R. The structural model for database design, in Proceedings of the International Conference on Entity Relationship Approach (Los Angeles, Dec. 1979), pp. 247-267.

W| EDERHOLD , G. Database Design . 2 nd ed., McGraw-Hill , New York , 1983 . W|EDERHOLD, G. Database Design. 2nd ed., McGraw-Hill, New York, 1983.

10.1109/VLDB.1979.718156

10.1145/582095.582109