Journal of Big Data

Công bố khoa học tiêu biểu

* Dữ liệu chỉ mang tính chất tham khảo

Sắp xếp:  
Efficient spatial data partitioning for distributed $$k$$ NN joins
Journal of Big Data - Tập 9 - Trang 1-42 - 2022
Ayman Zeidan, Huy T. Vo
Parallel processing of large spatial datasets over distributed systems has become a core part of modern data analytic systems like Apache Hadoop and Apache Spark. The general-purpose design of these systems does not natively account for the data’s spatial attributes and results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy the problem and introduce spatial data recognition and operations. At the core of a spatial extension, a locality-preserving spatial partitioner determines how to spatially group the dataset’s objects into smaller chunks using the distributed system’s available resources. Existing spatial extensions rely on data sampling and often mismanage non-spatial data by either overlooking their memory requirements or excluding them entirely. This work discusses the various challenges that face spatial data partitioning and proposes a novel spatial partitioner for effectively processing spatial queries over large spatial datasets. For evaluation, the proposed partitioner is integrated with the well-known k-Nearest Neighbor ( $$k$$ NN) spatial join query. Several experiments evaluate the proposal using real-world datasets. Our approach differs from existing proposals by (1) accounting for the dataset’s unique spatial traits without sampling, (2) considering the computational overhead required to handle non-spatial data, (3) minimizing partition shuffles, (4) computing the optimal utilization of the available resources, and (5) achieving accurate results. This contributes to the problem of spatial data partitioning through (1) providing a comprehensive discussion of the problems facing spatial data partitioning and processing, (2) the development of a novel spatial partitioning technique for in-memory distributed processing, (3) an effective, built-in, load-balancing methodology that reduces spatial query skews, and (4) a Spark-based implementation of the proposed work with an accurate $$k$$ NN spatial join query. Experimental tests show up to $$1.48$$ times improvement in runtime as well as the accuracy of results.
Towards a folksonomy graph-based context-aware recommender system of annotated books
Journal of Big Data - Tập 8 - Trang 1-17 - 2021
Sara Qassimi, El Hassan Abdelwahed, Meriem Hafidi, Aimad Qazdar
The emergence of collaborative interactions has empowered users by enabling their interactions through tagging practices that create a folksonomy, also called, classification of the shared resources, any identifiable thing or item on the system. In education, tagging is considered a powerful meta-cognitive strategy that successfully engages learners in the learning process. Besides, the collaborative tagging gathers learners’ opinions, thus, provides more comprehensible recommendations. Still, the abundant shared contents are mostly unorganized which makes it hard for users to select and discover the appropriate items of their interests. Thus, the use of recommender systems overcomes the distressing search problem by assisting users in their searching and exploring experience, and suggesting relevant items matching their preferences. In this regard, this article presents a folksonomy graphs based context-aware recommender system (CARS) of annotated books. The generated graphs express the semantic relatedness between these resources, i.e. books, by effectively modeling the folksonomy relationship between user-resource-tag and integrating contextual information within a multi-layer graph referring to a Knowledge Graph (KG). To put our proposal into shape, we model a real-world application of Goodbooks-10k dataset to recommend books. The proposed approach incorporates spectral clustering to deal with the graph partitioning problem. The experimental evaluation shows relevant performance results of graph-based book recommendations.
Array databases: concepts, standards, implementations
Journal of Big Data - Tập 8 - Trang 1-61 - 2021
Peter Baumann, Dimitar Misev, Vlad Merticariu, Bang Pham Huu
Multi-dimensional arrays (also known as raster data or gridded data) play a key role in many, if not all science and engineering domains where they typically represent spatio-temporal sensor, image, simulation output, or statistics “datacubes”. As classic database technology does not support arrays adequately, such data today are maintained mostly in silo solutions, with architectures that tend to erode and not keep up with the increasing requirements on performance and service quality. Array Database systems attempt to close this gap by providing declarative query support for flexible ad-hoc analytics on large n-D arrays, similar to what SQL offers on set-oriented data, XQuery on hierarchical data, and SPARQL and CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity. Several papers have compared models and formalisms, and benchmarks have been undertaken as well, typically comparing two systems against each other. While each of these represent valuable research to the best of our knowledge there is no comprehensive survey combining model, query language, architecture, and practical usability, and performance aspects. The size of this comparison differentiates our study as well with 19 systems compared, four benchmarked to an extent and depth clearly exceeding previous papers in the field; for example, subsetting tests were designed in a way that systems cannot be tuned to specifically these queries. It is hoped that this gives a representative overview to all who want to immerse into the field as well as a clear guidance to those who need to choose the best suited datacube tool for their application. This article presents results of the Research Data Alliance (RDA) Array Database Assessment Working Group (ADA:WG), a subgroup of the Big Data Interest Group. It has elicited the state of the art in Array Databases, technically supported by IEEE GRSS and CODATA Germany, to answer the question: how can data scientists and engineers benefit from Array Database technology? As it turns out, Array Databases can offer significant advantages in terms of flexibility, functionality, extensibility, as well as performance and scalability—in total, the database approach of offering “datacubes” analysis-ready heralds a new level of service quality. Investigation shows that there is a lively ecosystem of technology with increasing uptake, and proven array analytics standards are in place. Consequently, such approaches have to be considered a serious option for datacube services in science, engineering and beyond. Tools, though, vary greatly in functionality and performance as it turns out.
Mining frequent itemsets from streaming transaction data using genetic algorithms
Journal of Big Data - Tập 7 - Trang 1-20 - 2020
Sikha Bagui, Patrick Stanley
This paper presents a study of mining frequent itemsets from streaming data in the presence of concept drift. Streaming data, being volatile in nature, is particularly challenging to mine. An approach using genetic algorithms is presented, and various relationships between concept drift, sliding window size, and genetic algorithm constraints are explored. Concept drift is identified by changes in frequent itemsets. The novelty of this work lies in determining concept drift using frequent itemsets for mining streaming data, using the genetic algorithm framework. Formulas have been presented for calculating minimum support counts in streaming data using sliding windows. Testing highlighted that the ratio of the window size to transactions per drift was a key to good performance. Getting good results when the sliding window size was too small was a challenge since normal fluctuations in the data could appear to be a concept drift. Window size must be managed in conjunction with support and confidence values in order to achieve reasonable results. This method of detecting concept drift performed well when larger window sizes were used.
Discovering top-weighted k-truss communities in large graphs
Journal of Big Data - Tập 9 - Trang 1-25 - 2022
Wafaa M. A. Habib, Hoda M. O. Mokhtar, Mohamed E. El-Sharkawi
Community Search is the problem of querying networks in order to discover dense subgraphs-communities-that satisfy given query parameters. Most community search models consider link structure and ignore link weight while answering the required queries. Given the importance of link weight in different networks, this paper considers both link structure and link weight to discover top-r weighted k-truss communities via community search. The top-weighted k-truss communities are those communities with the highest weight and the highest cohesiveness within the network. All recent studies that considered link weight discover top-weighted communities via global search and index-based search techniques. In this paper three different algorithms are proposed to scale-up the existing approaches of weighted community search via local search. The performance evaluation shows that the proposed algorithms significantly outperform the existing state-of-the-art algorithms over different datasets in terms of search time by several orders of magnitude.
Developing a mathematical model of the co-author recommender system using graph mining techniques and big data applications
Journal of Big Data - Tập 8 Số 1 - 2021
Fezzeh Ebrahimi, Asefeh Asemi, Amin Nezarat, Andrea Kő
Abstract

Finding the most suitable co-author is one of the most important ways to conduct effective research in scientific fields. Data science has contributed to achieving this possibility significantly. The present study aims at designing a mathematical model of co-author recommender system in bioinformatics using graph mining techniques and big data applications. The present study employed an applied-developmental research method and a mixed-methods research design. The research population consisted of all scientific products in bioinformatics in the PubMed database. To achieve the research objectives, the most appropriate effective features in choosing a co-author were selected, prioritized, and weighted by experts. Then, they were weighted using graph mining techniques and big data applications. Finally, the mathematical co-author recommender system model in bioinformatics was presented. Data analysis instruments included Expert Choice, Excel, Spark, Scala and Python programming languages in a big data server. The research was conducted in four steps: (1) identifying and prioritizing the criteria effective in choosing a co-author using AHP; (2) determining the correlation degree of articles based on the criteria obtained from step 1 using algorithms and big data applications; (3) developing a mathematical co-author recommender system model; and (4) evaluating the developed mathematical model. Findings showed that the journal titles and citations criteria have the highest weight while the abstract has the lowest weight in the mathematical co-author recommender system model. The accuracy of the proposed model was 72.26. It was concluded that using content-based features and expert opinions have high potentials in recommending the most appropriate co-authors. It is expected that the proposed co-author recommender system model can provide appropriate recommendations for choosing co-authors on various fields in different contexts of scientific information. The most important innovation of this model is the use of a combination of expert opinions and systemic weights, which can accelerate the finding of co-authors and consequently saving time and achieving a greater quality of scientific products.

Experimenting sensitivity-based anonymization framework in apache spark
Journal of Big Data - Tập 5 - Trang 1-26 - 2018
Mohammed Al-Zobbi, Seyed Shahrestani, Chun Ruan
One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark’s processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.
Phân đoạn tự động các tòa nhà từ LIDAR dựa trên DGCNN và phân cụm Euclid Dịch bởi AI
Journal of Big Data - Tập 7 - Trang 1-18 - 2020
Ahmad Gamal, Ari Wibisono, Satrio Bagus Wicaksono, Muhammad Alvin Abyan, Nur Hamid, Hanif Arif Wisesa, Wisnu Jatmiko, Ronny Ardhianto
Sự nhu cầu mô hình hóa 3D từ các quan sát trên Trái Đất đang gia tăng, đặc biệt là cho mục đích quy hoạch và quản lý đô thị và khu vực. Kết quả của các quan sát 3D đã dần trở thành nguồn dữ liệu chính trong việc xác định chính sách và quy hoạch cơ sở hạ tầng. Trong nghiên cứu này, chúng tôi trình bày một phương pháp phân đoạn tự động các tòa nhà mà sử dụng trực tiếp dữ liệu LIDAR. Các công trình trước đây đã sử dụng phương pháp CNN để tự động phân đoạn các tòa nhà. Tuy nhiên, những công trình hiện có đã phụ thuộc nhiều vào việc chuyển đổi dữ liệu LIDAR thành các định dạng Mô Hình Địa Hình Số (DTM), Mô Hình Bề Mặt Số (DSM), hoặc Mô Hình Độ Cao Số (DEM). Các định dạng đó yêu cầu chuyển đổi dữ liệu LIDAR thành hình ảnh raster, điều này đặt ra thách thức trong việc đánh giá khối lượng các tòa nhà. Trong bài báo này, chúng tôi đã thu thập dữ liệu LIDAR bằng máy bay không người lái và trực tiếp phân đoạn các tòa nhà sử dụng dữ liệu LIDAR vừa nêu. Chúng tôi đã sử dụng thuật toán Mạng Nơ-ron Tích chập Đồ thị Động (DGCNN) để tách biệt các tòa nhà và thực vật. Sau đó, chúng tôi đã áp dụng Phân cụm Euclid để phân đoạn mỗi tòa nhà. Chúng tôi nhận thấy rằng sự kết hợp của những phương pháp này ưu việt hơn so với các công trình trước trong lĩnh vực này, với độ chính xác lên tới 95.57% và điểm số Giao ngay trên Liên (IOU) là 0.85.
#phân đoạn tự động #LIDAR #DGCNN #phân cụm Euclid #quy hoạch đô thị
Arabia Felix 2.0: a cross-linguistic Twitter analysis of happiness patterns in the United Arab Emirates
Journal of Big Data - Tập 6 - Trang 1-20 - 2019
Aamna Al Shehhi, Justin Thomas, Roy Welsch, Ian Grey, Zeyar Aung
The global popularity of social media platforms has given rise to unprecedented amounts of data, much of which reflects the thoughts, opinions and affective states of individual users. Systematic explorations of these large datasets can yield valuable information about a variety of psychological and sociocultural variables. The global nature of these platforms makes it important to extend this type of exploration across cultures and languages as each situation is likely to present unique methodological challenges and yield findings particular to the specific sociocultural context. To date, very few studies exploring large social media datasets have focused on the Arab world. This study examined social media use in Arabic and English across the United Arab Emirates (UAE), looking specifically at indicators of subjective wellbeing (happiness) across both languages. A large social media dataset, spanning 2013 to 2017, was extracted from Twitter. More than 17 million Twitter messages (tweets), written in Arabic and English and posted by users based in the UAE, were analyzed. Numerous differences were observed between individuals posting messages (tweeting) in English compared with those posting in Arabic. These differences included significant variations in the mean number of tweets posted, and the mean size of users networks (e.g. the number of followers). Additionally, using lexicon-based sentiment analytic tools (Hedonometer and Valence Shift Word Graphs), temporal patterns of happiness (expressions of positive sentiment) were explored in both languages across all seven regions (Emirates) of the UAE. Findings indicate that 7:00 am was the happiest hour, and Friday was the happiest day for both languages (the least happy day varied by language). The happiest months differed based on language, and there were also significant variations in sentiment patterns, peaks and troughs in happiness, associated with events of sociopolitical and religio-cultural significance for the UAE.
An integrated model for evaluation of big data challenges and analytical methods in recommender systems
Journal of Big Data - Tập 9 - Trang 1-26 - 2022
Adeleh Asemi, Asefeh Asemi, Andrea Ko, Ali Alibeigi
The study aimed to present an integrated model for evaluation of big data (BD) challenges and analytical methods in recommender systems (RSs). The proposed model used fuzzy multi-criteria decision making (MCDM) which is a human judgment-based method for weighting of RSs’ properties. Human judgment is associated with uncertainty and gray information. We used fuzzy techniques to integrate, summarize, and calculate quality value judgment distances. Then, two fuzzy inference systems (FIS) are implemented for scoring BD challenges and data analytical methods in different RSs. In experimental testing of the proposed model, A correlation coefficient (CC) analysis is conducted to test the relationship between a BD challenge evaluation for a collaborative filtering-based RS and the results of fuzzy inference systems. The result shows the ability of the proposed model to evaluate the BD properties in RSs. Future studies may improve FIS by providing rules for evaluating BD tools.
Tổng số: 577   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 10