Springer Science and Business Media LLC
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Sắp xếp:
On compressing data in wireless sensor networks for energy efficiency and real time delivery
Springer Science and Business Media LLC - Tập 31 - Trang 151-182 - 2012
Wireless sensor networks possess significant limitations in storage, bandwidth, processing, and energy. Additionally, real-time sensor network applications such as monitoring poisonous gas leaks cannot tolerate high latency. While some good data compression algorithms exist specific to sensor networks, in this paper we present TinyPack, a suite of energy-efficient methods with high-compression ratios that reduce latency, storage, and bandwidth usage further in comparison with some other recently proposed algorithms. Our Huffman style compression schemes exploit temporal locality and delta compression to provide better bandwidth utilization important in the wireless sensor network, thus reducing latency for real time sensor-based monitoring applications. Our performance evaluations over many different real data sets using a simulation platform as well as a hardware implementation show comparable compression ratios and energy savings with a significant decrease in latency compared to some other existing approaches. We have also discussed robust error correction and recovery methods to address packet loss and corruption common in sensor network environments.
On resolving schematic heterogeneity in multidatabase systems
Springer Science and Business Media LLC - Tập 1 - Trang 251-279 - 1993
The objective of a multidatabase system is to provide a single uniform interface to accessing multiple independent databases being managed by multiple independent, and possibly heterogeneous, database systems. One crucial element in the design of a multidatabase system is the design of a data definition language for specifying a schema that represents the integration of the schemas of multiple independent databases. The design of such a language in turn requires a comprehensive classification of the conflicts (i.e., discrepancies) among the schemas of the independent databases and development of techniques for resolving (i.e., homogenizing) all of the conflicts in the classification. An earlier paper provided a comprehensive classification of schematic conflicts that may arise when integrating multiple independent relational database (RDB) schemas into a single multidatabase (MDB) schema. In this paper, we provide a comprehensive classification of techniques for resolving the schematic conflicts that may arise when integrating multiple RDB schemas, or RDB schemas and object-oriented database (OODB) schemas, or multiple OODB schemas. The classification of conflict resolution techniques includes not only those necessary for resolving schematic conflicts identified in the earlier paper, but also additional conflicts that arise when OODBs become part of the databases to be integrated. Most of the conflict resolution techniques discussed in the paper have already been incorporated into SQL/M, a multidatabase language implemented in UniSQL/M, a commercially available multidatabase system from UniSQL, Inc. which integrated SQL-based relational database systems and the UniSQL/X unified relational and object-oriented database system.
Set similarity join on massive probabilistic data using MapReduce
Springer Science and Business Media LLC - Tập 32 - Trang 447-464 - 2013
In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.
Scalable graph-based OLAP analytics over process execution data
Springer Science and Business Media LLC - - 2016
General-purpose blade infrastructure for configurable system architectures
Springer Science and Business Media LLC - Tập 22 - Trang 197-198 - 2007
Performance models of data parallel DAG workflows for large scale data analytics
Springer Science and Business Media LLC - Tập 41 - Trang 299-329 - 2023
Directed Acyclic Graph (DAG) workflows are widely used for large-scale data analytics in cluster-based distributed computing systems. The performance model for a DAG on data-parallel frameworks (e.g., MapReduce) is a research challenge because the allocation of preemptable system resources among parallel jobs may dynamically vary during execution. This resource allocation variation during execution makes it difficult to accurately estimate the execution time. In this paper, we tackle this challenge by proposing a new cost model, called Bottleneck Oriented Estimation (BOE), to estimate the allocation of preemptable resources by identifying the bottleneck to accurately predict task execution time. For a DAG workflow, we propose a state-based approach to iteratively use the resource allocation property among stages to estimate the overall execution plan. Furthermore, to handle the skewness of various jobs, we refine the model with the order statistics theory to improve estimation accuracy. Extensive experiments were performed to validate these cost models with HiBench and TPC-H workloads. The BOE model outperforms the state-of-the-art models by a factor of five for task execution time estimation. For the refined skew-aware model, the average prediction error is under
$$3\%$$
when estimating the execution time of 51 hybrid analytics (HiBench) and query (TPC-H) DAG workflows.
Stratified random sampling from streaming and stored data
Springer Science and Business Media LLC - Tập 39 - Trang 665-710 - 2020
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is
$$\varOmega (r)$$
factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is locally variance-optimal. We prove that any sliding window-based streaming SRS needs a workspace of
$$\varOmega (rM\log W)$$
in the worst case, to maintain a variance-optimal SRS of size M, where W is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only O(M) workspace but can maintain an SRS of size close to M in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.
Consensus Methods for Solving Inconsistency of Replicated Data in Distributed Systems
Springer Science and Business Media LLC - Tập 14 - Trang 53-69 - 2003
Replication of data is a popular and convenient form of data organization in distributed systems. Together with its advantages, data replication brings specific problems, which have to be solved by system designers. This paper deals with methods for resolving inconsistencies in data replication. The problem investigated in this work is: How to restore the data consistency if after some time of functioning their versions differ from each other on some sites of the system. We propose a solution of this problem by determining consensus of replicated data versions. We assume that there is a possibility to define a distance function between versions of replicated data, next different consensus choice functions are defined and analyzed. A numerical and practical example of applying these methods is also presented.
Efficient parallel processing of range queries through replicated declustering
Springer Science and Business Media LLC - Tập 20 - Trang 117-147 - 2006
A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, and the most common type of queries on such data, i.e., range queries. An optimal declustering scheme is one in which the processing for all range queries is balanced uniformly among the available disks. It has been shown that single copy based declustering schemes are non-optimal for range queries. In this paper, we integrate replication in conjunction with parallel disk declustering for efficient processing of range queries. We note that replication is largely used in database applications for several purposes like load balancing, fault tolerance and availability of data. We propose theoretical foundations for replicated declustering and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal for a number of disks. We propose a framework for replicated declustering, using a limited amount of replication and provide extensions to apply it on real data, which include arbitrary grids and a large number of disks. Our framework also provides an effective indexing scheme that enables fast identification of data of interest in parallel servers. In addition to optimal processing of single queries, we show that this framework is effective for parallel processing of multiple queries. We present experimental results comparing the proposed replication scheme to other techniques for both single queries and multiple queries, on synthetic and real data sets.
Distributed computing connected components with linear communication cost
Springer Science and Business Media LLC - Tập 36 - Trang 555-592 - 2018
The paper studies three fundamental problems in graph analytics, computing connected components (CCs), biconnected components (BCCs), and 2-edge-connected components (ECCs) of a graph. With the recent advent of big data, developing efficient distributed algorithms for computing CCs, BCCs and ECCs of a big graph has received increasing interests. As with the existing research efforts, we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur
$$O(m\times \#\text {supersteps})$$
total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to compute CCs and BCCs with O(m) total communication cost. The total computation costs of our techniques are also smaller than that of the existing techniques in practice, though theoretically almost the same. Moreover, we also study distributed computing ECCs. We are the first to study this problem and an approach with O(m) total communication cost is proposed. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time.
Tổng số: 431
- 1
- 2
- 3
- 4
- 5
- 6
- 10