Digital Preservation in Grids and Clouds: A Middleware Approach
Tóm tắt
Digital preservation is the persistent archiving of digital assets for future access and reuse, irrespective of the underlying platform and software solutions. Existing preservation systems have a strong focus on Grids, but the advent of cloud technologies offers an attractive option. We describe a middleware system that enables a flexible choice between a Grid and a cloud for ad-hoc computations that arise during the execution of a preservation workflow and also for archiving digital objects. The choice between different infrastructures remains open during the lifecycle of the archive, ensuring a smooth switch between different solutions to accommodate the changing requirements of the organization that needs its digital assets preserved. We also offer insights on the costs, running times, and organizational issues of cloud computing, proving that the cloud alternative is particularly attractive for smaller organizations without access to a Grid or with limited IT infrastructure.
Tài liệu tham khảo
Allinson, J.: OAIS as a reference model for repositories. Tech. rep., UKOLN, University of Bath (2006)
Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I.: Above the clouds: a Berkeley view of cloud computing. Tech. rep., EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2009-28 (2009)
Ball, A.: Briefing paper – the OAIS reference model. Tech. rep., UKOLN, University of Bath (2006)
Barateiro, J., Antunes, G., Borbinha, J., Lisboa, P.: Addressing digital preservation: proposals for new perspectives. In: Proceedings of InDP-09, 1st International Workshop on Innovation in Digital Preservation. Austin, TX, USA (2009)
Barateiro, J., Antunes, G., Cabral, M., Borbinha, J., Rodrigues, R.: Using a Grid for digital preservation. In: Proceedings of ICADL-08, 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information, pp. 225–235. Kuta, Indonesia (2008)
Beagrie, N.: Digital curation for science, digital libraries, and individuals. International Journal of Digital Curation 1(1), 3–16 (2006)
Bégin, M., Jones, B., Casey, J., Laure, E., Grey, F., Loomis, C., Kubli, R.: An EGEE comparative study: Grids and clouds – evolution or revolution. Tech. rep., Enabling Grids for E-sciencE-II (EGEE-II) Project Report INFSO-RI-031688 (2008)
Cafarella, M., Cutting, D.: Building Nutch: open source search. Queue 2(2), 54–61 (2004)
Comuzzi, M., Kotsokalis, C., Spanoudakis, G., Yahyapour, R.: Establishing and monitoring SLAs in complex service based systems. In: Proceedings of ICWS-09, 7th International Conference on Web Services, pp. 783–790. Los Angeles, CA, USA (2009)
Cundiff, M.: An introduction to the Metadata Encoding and Transmission Standard (METS). Libr. Hi Tech 22(1), 52–64 (2004)
Darányi, S., Wittek, P., Dobreva, M.: Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints. Int. J. Digit. Libr. (2011). doi:10.1007/s00799-012-0079-y
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of OSDI-04, 6th International Symposium on Operating Systems Design & Implementation. San Francisco, CA, USA (2004)
Déjean, H.: Numbered sequence detection in documents. Document Recognition and Retrieval XVII 7534(1), 753,405–12 (2010)
Déjean, H., Meunier, J.L.: On tables of contents and how to recognize them. Int. J. Doc. Anal. Recognit. 12(1), 1–20 (2009)
Engel, F., Klas, C., Brocks, H., Kranstedt, A., Jäschke, G., Hemmje, M.: Towards supporting context-oriented information retrieval in a scientific-archive based information lifecycle. In: Proceedings of Cultural Heritage online. Empowering users: an active role for user communities, pp. 135–140. Florence, Italy (2009)
Foster, I., Kesselman, C.: The Grid: Blueprint for a new Computing Infrastructure. Morgan Kaufmann (2004)
Foster, I., Zhao, Y., Raicu, I., Lu, S.: Cloud computing and Grid computing 360-degree compared. In: Proceedings of GCE-08, Grid Computing Environments Workshop, pp. 1–10 (2008)
Gospodnetic, O., Hatcher, E., et al.: Lucene in Action. Manning (2005)
Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: Proceedings of JCDL-03, 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 37–48. Houston, TX, USA (2003)
Hedges, M., Blanke, T., Hasan, A.: Rule-based curation and preservation of data: A data Grid approach using iRODS. Future Gener. Comput. Syst. 25(4), 446–452 (2009)
Hedges, M., Hasan, A., Blanke, T.: Management and preservation of research data with iRODS. In: Mitra P., Giles C., Carr L. (eds.) Proceedings of CIKM-07, 1st Workshop on CyberInfrastructure: Information Management in eScience, in conjuction with 16th Conference on Information and Knowledge Management, pp. 17–22. Lisbon, Portugal (2007)
Innocenti, P., Ross, S., Maceciuvite, E., Wilson, T., Ludwig, J., Pempe, W.: Assessing digital preservation frameworks: the approach of the SHAMAN project. In: Proceedings of MEDES-09, 1st International Conference on Management of Emergent Digital EcoSystems, pp. 412–416. Lyon, France (2009)
ISO 14721: Reference model for an Open Archival Information System (OAIS) fCCSDS 650.0-B-1 Blue book (2003)
Jacquin, T., Déjean, H., Chanod, J.P.: Xeproc©: a model-based approach towards document process preservation. In: Lalmas M., Jose J., Rauber A., Sebastiani F., Frommholz I. (eds.) Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, vol. 6273, pp. 538–541 (2010)
Knight, G., Hedges, M.: Modelling OAIS compliance for disaggregated preservation services. International Journal of Digital Curation 2(1), 62–72 (2008)
Larson, R., Sanderson, R.: Grid-based digital libraries: Cheshire3 and distributed retrieval. In: Proceedings of JCDL-05, 5th Joint Conference on Digital Libraries, pp. 112–113. Denver, CO, USA (2005)
Larson, R., Sanderson, R.: Cheshire3: retrieving from tera-scale Grid-based digital libraries. In: Proceedings of SIGIR-06, 29th Annual International Conference on Research and Development in Information Retrieval, pp. 730–730. Seattle, WA, USA (2006)
Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan & Claypool (2010)
Metsch, T., Edmonds, A., Bayon, V.: Using cloud standards for interoperability of cloud frameworks. Tech. rep., SLA@SOI (2010)
Michael, M., Moreira, J., Shiloach, D., Wisniewski, R.: Scale-up x scale-out: a case study using Nutch/Lucene. In: Proceedings of IPDPS-07, 21st International Parallel and Distributed Processing Symposium, pp. 1–8. Long Beach, CA, USA (2007)
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co (2010)
Phelps, T., Watry, P.: A no-compromises architecture for digital document preservation. Research and Advanced Technology for Digital Libraries pp. 266–277 (2005)
Phelps, T., Wilensky, R.: The multivalent browser: a platform for new ideas. In: Proceedings of DocEng-01, 1st Symposium on Document Engineering, pp. 58–67. Atlanta, GA, USA (2001)
Rajasekar, A., Moore, R., Hou, C., Lee, C., Marciano, R., de Torcy, A., Wan, M., Schroeder, W., Chen, S., Gilbert, L., et al.: iRODS primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services 2(1), 1–143 (2010)
Rimal, B., Jukan, A., Katsaros, D., Goeleven, Y.: Architectural requirements for cloud computing systems: an enterprise cloud approach. J. Grid Computing 9(1), 3–26 (2011)
Rings, T., Caryer, G., Gallop, J., Grabowski, J., Kovacikova, T., Schulz, S., Stokes-Rees, I.: Grid and cloud computing: opportunities for integration with the next generation network. J. Grid Comput. 7(3), 375–393 (2009)
Sanderson, R., Watry, P.: Integrating data and text mining processes for digital library applications. In: Proceedings of JCDL-07, 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 73–79. Vancouver, Canada (2007)
SHAMAN Consortium: WP2.D2.3 Specification of the SHAMAN reference architecture. Tech. rep., SHAMAN (2009)
Skinner, K., Schultz, M.: A Guide to Distributed Digital Preservation. Educopia Institute (2010)
Sunderam, V.: PVM: a framework for parallel distributed computing. Concurrency: practice and experience 2(4), 315–339 (1990)
Theilmann, W., Yahyapou, R.: SLA@SOI – SLAs empowering a dependable service economy. ERCIM News 2010(83), 16–17 (2010)
Tidwell, D.: XSLT: Mastering XML Transformations. O’Reilly Media, Inc. (2007)
Wan, M., Moore, R., Rajasekar, A.: Integration of cloud storage with data Grids. In: Proceedings of ICVCI-09, 3rd International Conference on the Virtual Computing Initiative. Research Triangle Park, NC, USA (2009)
Watry, P.: Digital preservation theory and application: transcontinental persistent archives testbed activity. International Journal of Digital Curation 2(2), 41–68 (2007)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media (2009)
Wittek, P., Darányi, S.: Leveraging on high-performance computing and cloud technologies in digital libraries: a case study. In: Proceedings of HPCCloud-11, Workshop on Integration and Application of Cloud Computing to High Performance Computing. Athens, Greece (2011)
Wittek, P., Jacquin, T., Déjean, H., Chanod, J.P., Darányi, S.: XML processing in the cloud: large-scale digital preservation in small institutions. In: Proceedings of DataCloud-11, 1st International Workshop on Data Intensive Computing in the Clouds in conjunction with the 25th IEEE International Parallel and Distributed Computing Symposium. Anchorage, AK, USA (2011)
Witten, I., Don, K., Dewsnip, M., Tablan, V.: Text mining in a digital library. Int. J. Digit. Libr. 4(1), 56–59 (2004)