Array databases: concepts, standards, implementations
Tóm tắt
Multi-dimensional arrays (also known as raster data or gridded data) play a key role in many, if not all science and engineering domains where they typically represent spatio-temporal sensor, image, simulation output, or statistics “datacubes”. As classic database technology does not support arrays adequately, such data today are maintained mostly in silo solutions, with architectures that tend to erode and not keep up with the increasing requirements on performance and service quality. Array Database systems attempt to close this gap by providing declarative query support for flexible ad-hoc analytics on large n-D arrays, similar to what SQL offers on set-oriented data, XQuery on hierarchical data, and SPARQL and CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity. Several papers have compared models and formalisms, and benchmarks have been undertaken as well, typically comparing two systems against each other. While each of these represent valuable research to the best of our knowledge there is no comprehensive survey combining model, query language, architecture, and practical usability, and performance aspects. The size of this comparison differentiates our study as well with 19 systems compared, four benchmarked to an extent and depth clearly exceeding previous papers in the field; for example, subsetting tests were designed in a way that systems cannot be tuned to specifically these queries. It is hoped that this gives a representative overview to all who want to immerse into the field as well as a clear guidance to those who need to choose the best suited datacube tool for their application. This article presents results of the Research Data Alliance (RDA) Array Database Assessment Working Group (ADA:WG), a subgroup of the Big Data Interest Group. It has elicited the state of the art in Array Databases, technically supported by IEEE GRSS and CODATA Germany, to answer the question: how can data scientists and engineers benefit from Array Database technology? As it turns out, Array Databases can offer significant advantages in terms of flexibility, functionality, extensibility, as well as performance and scalability—in total, the database approach of offering “datacubes” analysis-ready heralds a new level of service quality. Investigation shows that there is a lively ecosystem of technology with increasing uptake, and proven array analytics standards are in place. Consequently, such approaches have to be considered a serious option for datacube services in science, engineering and beyond. Tools, though, vary greatly in functionality and performance as it turns out.
Tài liệu tham khảo
Abadi M, et al. Tensorflow: A system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2016. p. 265–283.
Abadi D. On Big Data, analytics and hadoop. ODBMS Industry Watch. 2012. http://www.odbms.org/blog/2012/12/on-big-data-analytics-and-hadoop-interview-with-daniel-abadi/. Accessed 23 Aug 2020.
Abadi M. TensorFlow: Learning functions at scale. Proc. ACM SIGPLAN Intl. Conference on Functional Programming. St Petersburg, USA, 2016.
Andrejev A, Baumann P, Misev D, Risch T. Spatio-temporal gridded data processing on the semantic web. Proc. Intl. Conf. on Data Science and Data Intensive Systems (DSDIS). Sydney, Australia, 2015.
Baumann P. A database array algebra for spatio-temporal data and beyond. Proc. Intl. Workshop on Next Generation Information Technologies and Systems (NGITS). Zikhron Yaakov, Israel. Springer LNCS 1649. 1999.
Baumann P. Array Databases. In: Özsu T, Liu L, editors. Encyclopedia of Database Systems. Springer, 2017.
Baumann P. Beyond Rasters: Introducing The New OGC Web Coverage Service 2.0. Proc. ACM SIGSPATIAL GIS. San Jose, USA, 2010.
Baumann P, Feyzabadi S, Jucovschi C. Putting pixels in place: a storage layout language for scientific data. Proc. IEEE ICDM Workshop on spatial and spatiotemporal data mining (SSTDM). Sydney. 2010;194:201.
Baumann P, Hirschorn E, Maso J, Dumitru A, Merticariu V. Taming Twisted Cubes. Proc. ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data (GeoRich). San Francisco. 2016.
Baumann P, Hirschorn E, Maso J. OGC Coverage Implementation Schema 1.1. OGC document 09-146r8. http://docs.opengeospatial.org/is/09-146r8/09-146r8.html. Accessed 23 Aug 2020.
Baumann P, Hirschorn E, Maso J, Merticariu V, Misev D. All in One: Encoding Spatio-Temporal Big Data in XML, JSON, and RDF without Information Loss. Proc. IEEE Intl. Workshop on Big Spatial Data (BSD). Boston, 2017.
Baumann P, Holsten S. A Comparative Analysis of Array Models for Databases. Proc. Database Theory and Application (DTA). Jeju Island, Korea. 2011, Communications in Computer and Information Science 258, Springer 2011.
Baumann P, Howe B, Orsborn K, Stefanova S: Proc. EDBT/ICDT Workshop on Array Databases. Uppsala, Sweden, 2011. https://www.rasdaman.com/ArrayDatabases-Workshop/. Accessed 23 Aug 2020.
Baumann P. Language Support for Raster Image Manipulation in Databases. Proc. Int. Workshop on Graphics Modeling, Visualization in Science & Technology. Darmstadt, Germany, 1992.
Baumann P, Merticariu V. On the efficient evaluation of array joins. Proc. IEEE Big Data Workshop Big Data in the geo sciences. Santa Clara; 2015.
Baumann P, Misev D, Merticariu V, Pham Huu B, Bell B, Kuo KS. Array Databases: Concepts, Standards, Implementations. RDA Array Database Assessment Working Group. 2018, https://rd-alliance.org/system/files/Array-Databases_final-report.pdf. Accessed on 23 Aug 2020.
Baumann P. OGC Web Coverage Processing Service (WCPS) Language interface standard, version 1.0. OGC document 08-068r2. https://www.ogc.org/standards/wcps. Accessed 23 Aug 2020.
Baumann P. On the management of multidimensional discrete data. VLDB J. 1994;4(3):401: – 444.
Baumann P, Rossi AP, Bell B, Clements O, Evans B, Hoenig H, Hogan P, Kakaletris G, Koltsida P, Mantovani S, Marco Figuera R, Merticariu V, Misev D, Pham Huu B, Siemen S, Wagemann J: Fostering cross-disciplinary earth science through datacube analytics. In: Mathieu PP, Aubrecht C, editors. Earth observation open science and innovation - changing the world one pixel at a time. International Space Science Institute (ISSI), 2017; 91:119.
Baumann P, Stamerjohanns H. Towards a systematic benchmark for array database systems. Proc. Workshop on Big Data Benchmarking (WBDB). Pune. 2021. Springer LNCS 8163.
Baumann P. The Datacube Manifesto. http://earthserver.eu/tech/datacube-manifesto. Accessed 23 Aug 2020.
Baumann P. The OGC Web Coverage Processing Service (WCPS) Standard. Geoinformatica. 2010;14(4):447:479.
Big Earth. Datacube Standards – Understanding the OGC/ISO Coverage Data and Service Model. http://standards.rasdaman.com. Accessed 23 Aug 2020.
Big Earth Datacube Standards. https://standards.rasdaman.com. Accessed 23 Aug 2020.
Blaschka M, Sapia C, Höfling G, Dinter B. Finding your way through multidimensional data models. Proc. DEXA Workshop Data Warehouse Design and OLAP Technology (DWDOT). Vienna. 1998;198:203.
Boost. boost. http://www.boost.org. Accessed 23 Aug 2020.
Boost. boost. https://github.com/boostorg/boost. Accessed 23 Aug 2020.
Brodie M, Blaustein B, Dayal U, Manola F, Rosenthal A. CAD/CAM database management. IEEE Database Eng Bull. 1984;7(2):20.
Bekla J, et al. XLDB 2018. https://conf.slac.stanford.edu/xldb2018/agenda. Accessed 28 Sep 2020.
Brown PG. Overview of SciDB: large scale array storage, processing and analysis. Proc. ACM SIGMOD. Indianapolis. 2010; 963:968.
Buck J. SciHadoop. https://github.com/four2five/SciHadoop. Accessed 23 Aug 2020.
Cheng Y, Rusu F. Astronomical data processing in EXTASCID. Proc. Intl. Conf. on Scientific and Statistical Database Management (SSDBM). Baltimore. 2013; 1:4.
Cheng Y, Rusu F. Formal representation of the SS-DB benchmark and experimental evaluation in EXTASCID. Distributed Parallel Databases. 2015;277:317.
Codd EF. A relational model of data for large shared data banks. Comm ACM. 1970;13(6):377:387.
CODE-DE Datacubes. https://processing.code-de.org/rasdaman. Accessed 23 Aug 2020.
Cudre-Maroux P, et al. SS-DB: a standard science DBMS benchmark. 2010. (submitted for publication).
Cudre-Mauroux P, et al. A demonstration of SciDB: a science-oriented DBMS. VLDB. 2009;2(2):1537.
Dean J, Ghemawat S. MapReduce. Simplified data processing on large clusters. Proc. 6th Symposium on Operating System Design and Implementation (OSDI), San Francisco. 2004. USENIX Association 2004. p. 137–150.
Dehmel A. A Compression engine for multidimensional array database systems. PhD Thesis, TU München. 2002.
Dumitru A, Merticariu V, Baumann P. Exploring cloud opportunities from an array database perspective. Proc ACM SIGMOD Workshop on Data Analytics in the Cloud (DanaC). Snowbird. 2014.
EarthServer Coverage Webinars. https://earthserver.xyz/wcs. Accessed 23 Aug 2020.
Ensor P. Organizational renewal —tearing down the functional silos. AME Target: Summer; 1988. p. 4–16.
Furtado P, Baumann P. Storage of multidimensional arrays based on arbitrary tiling. Proc. Intl. Conference on Data Engineering (ICDE). Sydney. 1999.
GeoTrellis. GeoTrellis. http://geotrellis.io. Accessed 23 Aug 2020.
GeoTrellis: GeoTrellis. https://github.com/geotrellis. Accessed 23 Aug 2020.
Gibson W. Data, data everywhere – the economist special report: managing information. 2010. http://www.economist.com/node/15557443. Accessed 23 Aug 2020.
Google. E, Engine. https://earthengine.google.com. Accessed 23 Aug 2020.
Gorelick N, Hancher M, Dixon M, Ilyushchenko S, Thau D, Moore R. Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sens Environ. 2017;202:27.
Gutierrez C, Hurtado C, Mendelzon A. Formal aspects of querying RDF databases. Intl. Workshop on Semantic Web and Databases. Co-located with VLDB 2003, Humboldt-Universität, Berlin. 2003, p. 293–307.
Guttman A. R-Trees: a dynamic index structure for spatial searching. Proc. ACM SIGMOD. 1984;47:57.
Hey T, Tansley S, Tolle K. The fourth paradigm. Microsoft research, October 2009, http://research.microsoft.com/en-us/collaboration/fourthparadigm/. Accessed 23 Aug 2020.
Howard T. A Shareable centralised database of KRT3 - a hierarchical graphics system based on PHIGS. Proc. Eurographics 1987, Eurographics Association, 1987.
Indyk P. Nearest neighbours in high-dimensional spaces. In: Goodman JE, O’Rourke J, editors. Handbook of discrete and computational geometry. London: Chapman and Hall; 2004. p. 877–892.
INSPIRE coverage download services. http://inspire.ec.europa.eu/id/document/tg/download-wcs. Accessed 23 Aug 2020.
ISO. 19123-1:2019 Coverage Fundamentals (Working Draft). http://external.opengeospatial.org/twiki_public/CoveragesDWG/WebHome#Related_Standards. Accessed 23 Aug 2020.
ISO. Information technology—Database languages—SQL—Part 15: Multi-Dimensional Arrays (SQL/MDA). ISO IS 9075-15:2017. https://www.iso.org/standard/67382.html. Accessed 23 Aug 2020.
ISO. Information technology—Database languages—SQL—Part 1: Framework (SQL/Framework). ISO IS 9075-1:2016.
Ivanova M, Kersten ML, Manegold S. Data vaults: a symbiosis between Database technology and scientific file repositories. Proc. Intl. Conference on Scientific and Statistical Database Management (SSDBM). Athens. 2012; 485:494.
Iverson KE. A Programming Language. Wiley: New York. 1962.
Kaur K, Rani R, Modeling and querying data in NoSQL databases. Proc. IEEE Intl. Conf. on Big Data, Silicon Valley. 2013. p. 1–7.
Koubarakis M, Datcu M, Kontoes C, Di Giammatteo U, Manegold S, Klien E. TELEIOS: a database-powered virtual earth observatory. VLDB 5, 2012; 2010:2013.
Liakos P, Koltsida P, Kakaletris G, Baumann P. xWCPS: Bridging the gap between array and semi-structured data. Proc. Intl. Conference on Knowledge Engineering and Knowledge Management. Springer. 2015.
Liakos P, Koltsida P, Baumann P, Ioannidis Y, Delis A. A distributed infrastructure for earth-science big data retrieval. Intl J Cooperative Inform Syst. 2015;24(2): 1550002.
Liaukevich V, Misev D, Baumann P, Merticariu V. Location and processing aware datacube caching. Proc. Intl. Conference on Scientific and Statistical Database Management (SSDBM). New York. 2017. Article 34.
Marek-Spartz M. Comparing map algebra implementations for Python: Rasterio and ArcPy. Volume 18, Papers in Resource Analysis. Saint Mary’s University of Minnesota Central Services Press. http://www.gis.smumn.edu/GradProjects/Marek-SpartzM.pdf. Accessed 23 Aug 2020.
Merticariu G, Misev D, Baumann P. Measuring storage access performance in array databases. Proc. Workshop on Big Data Benchmarking (WBDB). New Delhi. 2015.
Misev D, Baumann P. Enhancing Science Support in SQL. Proc. IEEE Big Data Workshop on Data and Computational Science Technologies for Earth Science Research. Santa Clara. 2015.
Misev D, Baumann P. Homogenizing data and metadata retrieval in scientific applications. Proc. ACM CIKM DOLAP. Melbourne. 2015; 25:34.
Misev BP. The Open-Source rasdaman Array DBMS. Proc. VLDB Workshop Big Data Open Source Systems (BOSS). New Delhi, India. 2016.
MonetDB: SciQL. https://projects.cwi.nl/scilens/content/platform.html. Accessed 23 Aug 2020.
MrGeo. MrGeo. https://github.com/ngageoint/mrgeo.. Accessed 23 Aug 2020.
Mundi Datacubes. https://mundi.rasdaman.com. Accessed 23 Aug 2020.
N.n. Hadoop. http://hadoop.apache.org/. Accessed 23 Aug 2020.
N.n. Spark. http://spark.apache.org/. Accessed 23 Aug 2020.
ODC. Open Data Cube. https://www.opendatacube.org. Accessed 23 Aug 2020.
ODC. Open Data Cube. https://github.com/opendatacube. Accessed 23 Aug 2020.
OGC Web Coverage Service. www.ogc.org/standards/wcs. Accessed 23 Aug 2020.
OGC Web Coverage Processing Service. www.ogc.org/standards/wcps. Accessed 23 Aug 2020.
OGC Compliance Testing. https://www.ogc.org/compliance. Accessed 23 Aug 2020.
Oosthoek J, Rossi AP, Baumann P, Misev D, Campalani P. PlanetServer: towards online analysis of planetary data. Planetary Data. 2012.
OPenDAP. http://www.opendap.org. Accessed 23 Aug 2020.
OPeNDAP. Software. https://www.opendap.org/software. Accessed 23 Aug 2020.
Ophidia: Ophidia. http://ophidia.cmcc.it. Accessed 23 Aug 2020.
Ophidia: Ophidia. https://github.com/OphidiaBigData. Accessed 23 Aug 2020.
Oracle: GeoRaster. http://docs.oracle.com/cd/B19306_01/appdev.102/b14254/geor_intro.htm. Accessed 23 Aug 2020.
Paradigm4: SciDB. https://www.paradigm4.com. Accessed 23 Aug 2020.
Paradigm4. SciDB Licensing. https://www.paradigm4.com/about/licensing/. Accessed 23 Aug 2020.
Paradigm4. SciDB Source Code. https://drive.google.com/drive/folders/0BzNaZtoQsmy2aGNoaV9Kdk5YZEE. Accessed 23 Aug 2020.
PlanetServer. http://planetserver.eu. Accessed 23 Aug 2020.
Planthaber G, Stonebraker M, Frew J. EarthDB: scalable analysis of MODIS data using SciDB. Proc. ACM SIGSPATIAL Intl. Workshop on Analytics for Big Geospatial Data. 2012. p. 11–19.
PostGIS. PostGIS Developers Wiki. https://trac.osgeo.org/postgis/wiki/DevWikiMain. Accessed 23 Aug 2020.
PostGIS. PostGIS Raster manual. http://postgis.net/docs/manual-dev/using_raster_dataman.html. Accessed 23 Aug 2020.
PostGIS. Raster PostGIS. https://postgis.net/docs/using_raster_dataman.htm. Accessed 23 Aug 2020.
Rasdaman. https://www.rasdaman.org. Accessed 23 Aug 2020.
Rasdaman. rasdaman. https://rasdaman.com. Accessed 23 Aug 2020.
Reiner B, Hahn K: Hierarchical Storage Support and Management for Large-Scale Multidimensional Array Database Management Systems. Proc. DEXA. Aix en Provence, France, 2002.
Ritter G, Wilson J, Davidson J. Image Algebra: An Overview. Computer Vision, Graphics, and Image Processing. 49(1)1990;297:331.
Rusu F, Cheng Y. A survey on array storage, query languages, and systems. arXiv preprint arXiv:1302.0103, 2013.
Rusu F. EXTASCID. http://faculty.ucmerced.edu/frusu/Projects/GLADE/extascid.html. Accessed 123 Aug 2020.
Sarawagi S, Stonebraker M: Efficient organization of large multidimensional arrays. Proc. Intl. Conf. on Data Engineering (ICDE). Houston. 1994. p. 328–336.
SciSpark. https://scispark.jpl.nasa.gov. Accessed 23 Aug 2020.
SciSpark: SciSpark. https://github.com/SciSpark. Accessed 23 Aug 2020.
Session ESSI2.2: Data cubes of Big Earth Data - a new paradigm for accessing and processing Earth Science Data. https://meetingorganizer.copernicus.org/EGU2018/posters/28035. Accessed 23 Aug 2020.
Soroush E, Balazinska M, Wang D. ArrayStore: a storage manager for complex parallel array processing. Proc. ACM SIGMOD. Athens. 2011. p. 253–264.
Soussi R, Aufaure MA, Zghal HB. Towards social network extraction using a graph database. In: Laux F, Strömbäck L, editors Intl. Conf. on Advances in Databases, Knowledge, and Data Applications (DBKDA), Menuires. 2010, p. 28–34.
Stancu-Mara S, Baumann PA, Comparative Benchmark of Large Objects in Relational Databases. Proc. IDEAS 2008, Coimbra, Portugal, November 2008.
Stonebraker M, Brown P, Zhang D, BeclaJ. SciDB: a database management system for applications with complex analytics. Comput Sci Eng. 2013;15(3):62.
Stonebraker M, Ugur C. “One Size Fits All”: an idea whose time has come and gone. Proc. Intl. Conf. on Data Engineering (ICDE). Washington. 2005. p. 2–11.
Tan Z, Yue P, A comparative analysis to the array database technology and its use in flexible VCI derivation. Fifth Intl. Conference on Agro-Geoinformatics, July 2016, p. 1–5.
TensorFlow. TensorFlow Installation. https://www.tensorflow.org/install. Accessed 23 Aug 2020.
Teradata. Multidimensional array options. https://docs.teradata.com/reader/eWpPpcMoLGQcZEoyt5AjEg/0BjYNH4d7gS8CrkifJWErg. Accessed 23 Aug 2020.
Teradata. User–defined data type, ARRAY data type, and VARRAY data type limits. https://www.info.teradata.com/HTMLPubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1141_111A/appc.109.11.html. Accessed 23 Aug 2020.
The RDF Data cube vocabulary. https://www.w3.org/TR/vocab-data-cube/. Accessed 23 Aug 2020.
The EarthServer. Datacube federation. earthserver.xyz. Accessed 23 Aug 2020.
TileDB: TileDB. https://github.com/TileDB-Inc. Accessed 23 Aug 2020.
TileDB. TileDB. https://tiledb.io. Accessed 23 Aug 2020.
Tomlin D. A Map Algebra. Harvard Graduate School of Design, 1990.
W3C. Extensible Markup Language (XML) 1.0. https://www.w3.org/TR/REC-xml/. Accessed 23 Aug 2020.
Webster P. Supercomputing the climate: NASA’s Big Data Mission. CSC World Computer Sciences Corporation, 2012.
Wendelin.core. Wendelin.core. https://lab.nexedi.com/nexedi/wendelin.core. Accessed 23 Aug 2020.
Wendelin.core. Licensing. https://www.nexedi.com/licensing. Accessed 23 Aug 2020.
Wu J. ArrayUDF Explores structural locality for faster scientific analyses. Proc. XLDB, Stanford. 2018.
Xarray. xarray. http://xarray.pydata.org. Accessed 23 Aug 2020.
Xtensor. xtensor. http://quantstack.net/xtensor. Accessed 23 Aug 2020.
Xtensor. xtensor. https://github.com/QuantStack/xtensor. Accessed 23 Aug 2020.
Zhang Y, Kersten ML, Ivanova M, Nes N. SciQL, bridging the gap between science and relational DBMS. Proc. IDEAS. 2011, p. 124–133.