Stratified random sampling from streaming and stored data

Springer Science and Business Media LLC - Tập 39 - Trang 665-710 - 2020
Trong Duc Nguyen1, Ming-Hung Shih1, Divesh Srivastava2, Srikanta Tirthapura1, Bojian Xu3
1Iowa State University, Ames, USA
2AT&T - Research, Austin, USA
3Eastern Washington University, Cheney, USA

Tóm tắt

Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is $$\varOmega (r)$$ factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is locally variance-optimal. We prove that any sliding window-based streaming SRS needs a workspace of $$\varOmega (rM\log W)$$ in the worst case, to maintain a variance-optimal SRS of size M, where W is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only O(M) workspace but can maintain an SRS of size close to M in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.

Tài liệu tham khảo

Kandula, S., Shanbhag, A., Vitorovic, A., Olma, M., Grandl, R., Chaudhuri, S., Ding, B.: Quickr: lazily approximating complex adhoc queries in bigdata clusters. In: SIGMOD, pp. 631–646 (2016)

Chaudhuri, S., Das, G., Narasayya, V.: Optimized stratified sampling for approximate query processing. ACM TODS (2007). https://doi.org/10.1145/1242524.1242526

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP, pp. 423–438 (2013)

Ding, B., Huang, S., Chaudhuri, S., Chakrabarti, K., Wang, C.: Sample + seek: approximating aggregates with distribution precision guarantee. In: SIGMOD, pp. 679–694 (2016)

Cochran, W.G.: Sampling Techniques, 3rd edn. Wiley, New York (1977)

Haas, P.J.: Data-stream sampling: basic techniques and results. Data Stream Management, pp. 13–44. Springer, Berlin (2016)

Lohr, S.L.: Sampling: Design and Analysis, 2nd edn. Duxbury Press, London (2009)

Thompson, S.K.: Sampling, 3rd edn. Wiley, New York (2012)

Tillé, Y.: Sampling Algorithms, 1st edn. Springer, Berlin (2006)

Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Continuous sampling from distributed streams. JACM (2012). https://doi.org/10.1145/0000000.0000000

http://openaq.org

https://www.divvybikes.com/system-data