Efficient spatial data partitioning for distributed $$k$$ NN joins

Journal of Big Data - Tập 9 - Trang 1-42 - 2022
Ayman Zeidan1, Huy T. Vo2
1Department of Computer Science, CUNY Graduate Center, New York, USA
2Department of Computer Science, CUNY City College, New York, USA

Tóm tắt

Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k-Nearest Neighbor ( $$k$$ NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ NN spatial join query. Experimental tests show up to $$1.48$$ times improvement in runtime as well as the accuracy of results.

Tài liệu tham khảo

Bernard Marr Fc. How much data do we create every day the mindblowing stats everyone should read. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/?sh=356c32f960ba. Rohit Kulkarni Fc. Big data goes big. https://www.forbes.com/sites/rkulkarni/2019/02/07/big-data-goes-big/?sh=7284031320d7 Tankovska HS. Number of social media users 2025 Statista. https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/ Kim G-H, Trimi S, Chung J-H. Big-data applications in the government sector. Commun ACM. 2014;57(3):78–85. https://doi.org/10.1145/2500873. Zheng Y, Liu Y, Yuan J, Xie X. Urban computing with taxicabs. In: Proceedings of the 13th International Conference on Ubiquitous Computing, 2011; pp. 89–98. https://doi.org/10.1145/2030112.2030126. Zhang D, Zhao J, Zhang F, He T. comobile: Real-time human mobility modeling at urban scale using multi-view learning. In: Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2015; pp. 1–10. https://doi.org/10.1145/2820783.2820821. Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y. T-drive: driving directions based on taxi trajectories. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, 2010; 99–108. https://doi.org/10.1145/1869790.1869807. Huang Y, Powell JW. Detecting regions of disequilibrium in taxi services under uncertainty. In: Proceedings of the 20th International conference on advances in geographic information systems, 2012; pp. 139–148. https://doi.org/10.1145/2424321.2424340. Markets and Markets. Geospatial solutions market worth \$502.6 Billion by 2024 - Exclusive report by markets and marketsTM. 2019. https://www.prnewswire.com/news-releases/geospatial-solutions-market-worth-502-6-billion-by-2024--exclusive- report-by-marketsandmarkets-300895569.html . Shiftehfar R. Uber’s big data platform: 100+ petabytes with minute latency. 2018. https://eng.uber.com/uber-big-data-platform/ . Li B, Zhang D, Sun L, Chen C, Li S, Qi G, Yang Q. Hunting or waiting? discovering passenger-finding strategies from a large-scale real-world taxi dataset. In: 2011 IEEE International conference on pervasive computing and communications workshops (PERCOM Workshops), IEEE. 2011; pp. 63–68. https://doi.org/10.1109/PERCOMW.2011.5766967. Hadoop A. Apache Hadoop. https://hadoop.apache.org/. Foundation TAS. Apache spark unified analytics engine for big data. https://spark.apache.org/. Yu J, Wu J, Sarwat M. Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, 2015; pp. 1–4. https://doi.org/10.1145/2820783.2820860. Huang Z, Chen Y, Wan L, Peng X. Geospark sql: an effective framework enabling spatial queries on spark. ISPRS Int J Geo Inform. 2017;6(9):285. https://doi.org/10.3390/ijgi6090285. Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG. Locationspark: a distributed in-memory data management system for big spatial data. Proc VLDB Endow. 2016;9(13):1565–8. https://doi.org/10.14778/3007263.3007310. Hagedorn S, Gotze P, Sattler K-U. The stark framework for spatio-temporal data analytics on spark. Datenbanksysteme für Business, Technologie und Web (BTW 2017). 2017. Jacox EH, Samet H. Spatial join techniques. ACM Trans Database Syst. 2007;32(1):7. https://doi.org/10.1145/1206049.1206056. Zeidan A, Lagerspetz E, Zhao K, Nurmi P, Tarkoma S, Vo HT. Geomatch: efficient large-scale map matching on apache spark. ACM Trans Data Sci. 2020;1(3):1–30. https://doi.org/10.1145/3402904. Shekhar S, Lu C, Tan X, Chawla S, Vatsavai R. A visualization tool for spatial data warehouses. Geogr Data Min Knowl Dis. 2001;73:16–72. Eldawy A, Mokbel MF, Jonathan C. Hadoopviz: A mapreduce framework for extensible visualization of big spatial data. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE. 2016; pp. 601–612 . https://doi.org/10.1109/ICDE.2016.7498274. Roussopoulos N, Kelley S, Vincent F. Nearest neighbor queries. In: Proceedings of the 1995 ACM SIGMOD International conference on management of data, 1995; pp. 71–79. https://doi.org/10.1145/223784.223794. Hadoop A. Cluster mode overview - spark 2.4.0 Documentation.html. http://spark.apache.org/docs/2.4.0/cluster-overview.html. 2020. Jelvix. Top 10 big data frameworks in 2021 | Jelvix. https://jelvix.com/blog/top-5-big-data-frameworks. Forbes. Spark or hadoop – which is the best big data framework? https://www.forbes.com/sites/bernardmarr/2015/06/22/spark-or-hadoop-which-is-the-best-big-data-framework/?sh=55ab4da7127e Microsoft. What is apache spark? | Microsoft Docs. https://docs.microsoft.com/en-us/dotnet/spark/what-is-spark. Scheuermann P, Weikum G, Zabback P. Data partitioning and load balancing in parallel disk systems. VLDB J. 1998;7(1):48–66. https://doi.org/10.1007/s007780050053. Lee K, Liu L. Scaling queries over big rdf graphs with semantic hash partitioning. Proc VLDB Endow. 2013;6(14):1894–905. https://doi.org/10.14778/2556549.2556571. Abadi DJ, Marcus A, Madden SR, Hollenbach K. Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd international conference on very large data bases, 2007; pp. 411–422. Vo H, Aji A, Wang F. SATO: a spatial data partitioning framework for scalable query processing. In: Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems. SIGSPATIAL ’14, pp. 545–548. New York; ACM. https://doi.org/10.1145/2666310.2666365, 2014. Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endow. 2013;6(11):1009–20. Eldawy A. Spatialhadoop: towards flexible and scalable spatial processing using mapreduce. In: Proceedings of the 2014 SIGMOD PhD symposium, ACM 2014; pp. 46–50. https://doi.org/10.1145/2602622.2602625. Magellan. GitHub—harsha2010/magellan: geo spatial data analytics on spark. https://github.com/harsha2010/magellan He Y, Tan H, Luo W, Feng S, Fan J. Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci. 2014;8(1):83–99. https://doi.org/10.1007/s11704-013-3158-3. Xie D, Li F, Yao B, Li G, Zhou L, Guo M. Simba: Efficient in-memory spatial analytics. In: Proceedings of the 2016 international conference on management of data, 2016; pp. 1071–1085. https://doi.org/10.1145/2882903.2915237. Leutenegger ST, Lopez MA, Edgington J. Str: a simple and efficient algorithm for r-tree packing. In: Proceedings 13th international conference on data engineering. IEEE. 1997; pp. 497–506. https://doi.org/10.1109/ICDE.1997.582015. Al Aghbari Z, Ismail T, Kamel I. Sparknn: a distributed in-memory data partitioning for knn queries on big spatial data. Data Sci J. 2020;19(1):00. https://doi.org/10.5334/dsj-2020-035. Chatzigeorgakidis G, Karagiorgou S, Athanasiou S, Skiadopoulos S. Fml-knn: scalable machine learning on big data using k-nearest neighbor joins. J Big Data. 2018;5(1):1–27. https://doi.org/10.1186/s40537-018-0115-x. Ben Brahim M, Drira W, Filali F, Hamdi N. Spatial data extension for cassandra nosql database. J Big Data. 2016;3(1):1–16. https://doi.org/10.1186/s40537-016-0045-4. Costa E, Costa C, Santos MY. Evaluating partitioning and bucketing strategies for hive-based big data warehousing systems. J Big Data. 2019;6(1):1–38. https://doi.org/10.1186/s40537-019-0196-1. Minasny B, McBratney AB, Walvoort DJ. The variance quadtree algorithm: use for spatial sampling design. Comput Geosci. 2007;33(3):383–92. https://doi.org/10.1016/j.cageo.2006.08.009. Li Z, Lee KC, Zheng B, Lee W-C, Lee D, Wang X. Ir-tree: an efficient index for geographic document search. IEEE Trans knowl Data Eng. 2010;23(4):585–99. https://doi.org/10.1109/TKDE.2010.149. Aragon CR, Seidel R. Randomized search trees In: FOCS. 1989;30:540–5. https://doi.org/10.1007/BF01940876. NJordan72. harsha2010: GitHub—harsha2010/magellan: geo spatial data analytics on spark. https://github.com/harsha2010/magellan. Li J, Xu L, Tang L, Wang S, Li L. Big data in tourism research: a literature review. Tour Manag. 2018;68:301–23. https://doi.org/10.1016/j.tourman.2018.03.009. GeoJSON. GeoJSON. https://geojson.org/. (undefined 11/3/2021 23:28). Zeidan A, Lagerspetz E, Zhao K, Nurmi P, Tarkoma S, Vo HT. Geomatch: Efficient large-scale map matching on apache spark. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018; pp. 384–391. https://doi.org/10.1109/BigData.2018.8622488. Chang H-w, Tai Y-c, Hsu JY-j. Context-aware taxi demand hotspots prediction. Int J Bus Intell Data Min. 2010;5(1):3–18. https://doi.org/10.1504/IJBIDM.2010.030296. Zhang H, Chen G, Ooi BC, Tan K-L, Zhang M. In-memory big data management and processing: a survey. IEEE Tran Knowl Data Eng. 2015;27(7):1920–48. https://doi.org/10.1109/TKDE.2015.2427795. Cahsai A, Ntarmos N, Anagnostopoulos C, Triantafillou P. Scaling k-nearest neighbours queries (the right way). In: 2017 IEEE 37th international conference on distributed computing systems (ICDCS). IEEE. 2017; pp. 1419–1430 . https://doi.org/10.1109/ICDCS.2017.267. George L. HBase: the definitive guide: random access to your planet-size data. “ O’Reilly Media, Inc.”. Spark A. [SPARK-6235] Address various 2G limits–ASF JIRA. https://issues.apache.org/jira/browse/SPARK-6235 Guttman A. R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International conference on management of data, 1984; pp. 47–57. https://doi.org/10.1145/602259.602266. Beckmann N, Kriegel H, Schneider R, Seeger B. The r*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the 1990 ACM SIGMOD international conference on management of data, 1990; pp. 322–331. https://doi.org/10.1145/93597.98741. Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM. 1975;18(9):509–17. https://doi.org/10.1145/361002.361007. Finkel RA, Bentley JL. Quad trees a data structure for retrieval on composite keys. Acta inform. 1974;4(1):1–9. https://doi.org/10.1007/BF00288933. Samet H. An overview of quadtrees, octrees, and related hierarchical data structures. Theoretical foundations of computer graphics and CAD, 1988; 51–68. https://doi.org/10.1007/978-3-642-83539-1_2. Rigaux P, Scholl M, Voisard A. Spatial databases: with application to GIS. Elsevier. Spark A. Tuning—spark 2.4.0 documentation. https://spark.apache.org/docs/2.4.0/tuning.html#memory-management-overview. Foundation TAS. configuration—spark 2.4.5 documentation. https://spark.apache.org/docs/latest/configuration.html Foundation TAS. Spark 2.4.5 JavaDoc. https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/util/SizeEstimator.html. Pandey V, Kipf A, Neumann T, Kemper A. How good are modern spatial analytics systems? Proc VLDB Endow. 2018;11(11):1661–73. https://doi.org/10.14778/3236187.3236213. OpenStreetMap: researcher information—OpenStreetMap Wiki 2022. https://wiki.openstreetmap.org/wiki/Researcher_Information. MTA. MTA Bus Time® Historical data 2022. http://web.mta.info/developers/MTA-Bus-Time-historical-data.html. MTA. TLC trip record data—TLC 2022. https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Kumar M. World geodetic system 1984: a modern and accurate global reference frame. Mar Geod. 1988;12(2):117–26. https://doi.org/10.1080/15210608809379580. U.S. Department of Commerce, N.O., Administration, A.: North American Datum of 1983 (NAD 83) - Horizontal and Geometric Datums - Datums - National Geodetic Survey (2022). https://geodesy.noaa.gov/datums/horizontal/north-american-datum-1983.shtml. Wiki O. Points of interest—OpenStreetMap Wiki. https://wiki.openstreetmap.org/wiki/Points_of_interest. OpenStreetMap contributors: OpenStreetMap. 2017. https://www.openstreetmap.org. LocationTech: LocationTech JTS Topology Suite | projects.eclipse.org. https://projects.eclipse.org/projects/locationtech.jts.