Chimera: a virtual data system for representing, querying, and automating data derivation

I. Foster1,2, J. Vockler1, M. Wilde2, Yong Zhao1
1Department of Computer Science, University of Chicago, Chicago, IL, USA
2Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA

Tóm tắt

A lot of scientific data is not obtained from measurements but rather derived from other data by the application of computational procedures. We hypothesize that explicit representation of these procedures can enable documentation of data provenance, discovery of available methods, and on-demand data generation (so-called "virtual data"). To explore this idea, we have developed the Chimera virtual data system, which combines a virtual data catalog for representing data derivation procedures and derived data, with a virtual data language interpreter that translates user requests into data definition and query operations on the database. We couple the Chimera system with distributed "data grid" services to enable on-demand execution of computation schedules constructed from database queries. We have applied this system to two challenge problems, the reconstruction of simulated collision event data from a high-energy physics experiment, and searching digital sky survey data for galactic clusters, with promising results.

Từ khóa

#Data systems #Computer applications #Documentation #Distributed computing #Grid computing #Processor scheduling #Distributed databases #Computational modeling #Discrete event simulation #Physics

Tài liệu tham khảo

zhao, 2002, Virtual Galaxy Clusters: An Application of the GriPhyN Virtual Data Toolkit to Sloan Digital Sky Survey Data, Technical Report GriPhyN-2002–05 10.1109/ICDE.1997.581742 williams, 1998, Interfaces to Scientific Data Archives, Center for Advanced Computing Research chen, 1997, Constructing and Maintaining Scientific Database Views, Conference on Scientific and Statistical Database Management chervenak, 2001, The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Data Sets, J Network and Computer Applications, 187 10.1109/ICDE.2000.839437 10.1145/357775.357777 deelman, 2002, PhyN and LIGO, Building a Virtual Data Grid for Gravitational Wave Scientists, Proc 11th Int Symp High Performance Distributed Computing, 10.1109/HPDC.2002.1029922 deelman, 2001, Representing Virtual Data: A Catalog Architecture for Location and Materialization Transparency, Technical Report GriPhyN-2001–1 deelman, 2001, Transformation Catalog Design for GriPhyN, Technical Report GriPhyN-2001–1 foster, 2001, Data Grid Reference Architecture, Technical Report GriPhyN-2001–1 foster, 1999, The Grid Blueprint for a New Computing Infrastructure 10.1109/HPDC.2001.945176 10.1109/HPDC.2001.945178 10.1109/SC.2002.10021 10.1145/209891.209901 annis, 2000, The MaxBCG Technique for Finding Galaxy Clusters in SDSS Data, AAS 195th Meeting avery, 2001, An International Virtual-Data Grid Laboratory for Data Intensive Science, Technical Report GriPhyN-2001–1 10.1126/science.293.5537.2037 avery, 2001, The GriPhyN Project: Towards Petascale Virtual Data Grids, Technical Report GriPhyN-2001–15 buneman, 2002, Scientific Data, ACM SIGMOD International Conference on Management of Data baru, 1998, The SDSC Storage Resource Broker, Proc CASCON'98 Conference allcock, 2001, Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing, Mass Storage Conference buneman, 2001, Why and Where: A Characterization of Data Provenance, International Conference on Database Theory 2001, The DataGrid Architecture, EU DataGrid Project DataGrid-12-D12 4–333671–3–0 10.1016/S0010-4655(01)00253-3 ioannidis, 1996, ZOO: A Desktop Experiment Management Environment, Proc 22th Int Conf on Very Large Data Bases, 274 10.1142/S021821579200012X 10.1109/DCS.1988.12507 10.1147/sj.332.0326 marian, 2001, Change-Centric Management of Versions in an XML Warehouse, 28th International Conference on Very Large Data Bases della, 0, The CMS Experiment, The compact muon solenoid