Journal of Big Data

Công bố khoa học tiêu biểu

* Dữ liệu chỉ mang tính chất tham khảo

Sắp xếp:  
iiHadoop: an asynchronous distributed framework for incremental iterative computations
Journal of Big Data - Tập 4 - Trang 1-30 - 2017
Afaf G. Bin Saadon, Hoda M. O. Mokhtar
It is true that data is never static; it keeps growing and changing over time. New data is added and old data can either be modified or deleted. This incremental nature of data motivates the development of new systems to perform large-scale data computations incrementally. MapReduce was recently introduced to provide an efficient approach for handling large-scale data computations. Nevertheless, it turned to be inefficient in supporting the processing of small incremental data. While many previous systems have extended MapReduce to perform iterative or incremental computations, these systems are still inefficient and too expensive to perform large-scale iterative computations on changing data. In this paper, we present a new system called iiHadoop, an extension of Hadoop framework, optimized for incremental iterative computations. iiHadoop accelerates program execution by performing the incremental computations on the small fraction of data that is affected by changes rather than the whole data. In addition, iiHadoop improves the performance by executing iterations asynchronously, and employing locality-aware scheduling for the map and reduce tasks taking into account the incremental and iterative behavior. An evaluation for the proposed iiHadoop framework is presented using examples of iterative algorithms, and the results showed significant performance improvements over comparable existing frameworks.
On the sustainability of smart and smarter cities in the era of big data: an interdisciplinary and transdisciplinary literature review
Journal of Big Data - Tập 6 - Trang 1-64 - 2019
Simon Elias Bibri
There has recently been a conscious push for cities across the globe to be smart and even smarter and thus more sustainable by developing and implementing big data technologies and their applications across various urban domains in the hopes of reaching the required level of sustainability and improving the living standard of citizens. Having gained momentum and traction as a promising response to the needed transition towards sustainability and to the challenges of urbanisation, smart and smarter cities as approaches to data-driven urbanism are increasingly adopting the advanced forms of ICT to improve their performance in line with the goals of sustainable development and the requirements of urban growth. One of such forms that has tremendous potential to enhance urban operations, functions, services, designs, strategies, and policies in this direction is big data analytics and its application. This is due to the kind of well-informed decision-making and enhanced insights enabled by big data computing in the form of applied intelligence. However, topical studies on big data technologies and their applications in the context of smart and smarter cities tend to deal largely with economic growth and the quality of life in terms of service efficiency and betterment while overlooking and barely exploring the untapped potential of such applications for advancing sustainability. In fact, smart and smarter cities raise several issues and involve significant challenges when it comes to their development and implementation in the context of sustainability. With that in regard, this paper provides a comprehensive, state-of-the-art review and synthesis of the field of smart and smarter cities in relation to sustainability and related big data analytics and its application in terms of the underlying foundations and assumptions, research issues and debates, opportunities and benefits, technological developments, emerging trends, future practices, and challenges and open issues. This study shows that smart and smarter cities are associated with misunderstanding and deficiencies as regards their incorporation of, and contribution to, sustainability. Nevertheless, as also revealed by this study, tremendous opportunities are available for utilising big data analytics and its application in smart cities of the future to improve their contribution to the goals of sustainable development by optimising and enhancing urban operations, functions, services, designs, strategies, and policies, as well as by finding answers to challenging analytical questions and thereby advancing knowledge forms. However, just as there are immense opportunities ahead to embrace and exploit, there are enormous challenges and open issues ahead to address and overcome in order to achieve a successful implementation of big data technology and its novel applications in such cities.
Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients
Journal of Big Data - Tập 4 - Trang 1-18 - 2017
Christopher Baechle, Ankur Agarwal, Xingquan Zhu
Chronic Obstructive Pulmonary Disease (COPD) is a chronic lung disease that affects airflow to the lungs. Discovering the co-occurrence of COPD with other diseases, symptoms, and medications is invaluable to medical staff. Building co-occurrence indexes and finding causal relationships with COPD can be difficult because often times disease prevalence within a population influences results. A method which can better separate occurrence within COPD patients from population prevalence would be desirable. Large hospital systems may potentially have tens of millions of patient records spanning decades of collection and a big data approach that is scalable is desirable. The presented method, Co-Occurring Evidence Discovery (COED), presents a methodology and framework to address these issues. Natural Language Processing methods are used to examine 64,371 deidentified clinical notes and discover associations between COPD and medical terms. Apache cTAKES is leveraged to annotate and structure clinical notes. Several extensions to cTAKES have been written to parallelize the annotation of large sets of clinical notes. A co-occurrence score is presented which can penalize scores based on term prevalence, as well as a baseline method traditionally used for finding co-occurrence. These scoring systems are implemented using Apache Spark. Dictionaries of ground truth terms for diseases, medications, and symptoms have been created using clinical domain knowledge. COED and baseline methods are compared using precision, recall, and F1 score. The highest scoring diseases using COED are lung and respiratory diseases. In contrast, baseline methods for co-occurrence rank diseases with high population prevalence highest. Medications and symptoms evaluated with COED share similar results. When evaluated against ground truth dictionaries, the maximum improvements in recall for symptoms, diseases, and medications were 0.212, 0.130, and 0.174. The maximum improvements in precision for symptoms, diseases, and medications were 0.303, 0.333, and 0.180. Median increase in F1 score for symptoms, diseases, and medications were 38.1%, 23.0%, and 17.1%. A paired t-test was performed and F1 score increases were found to be statistically significant, where p < 0.01. Penalizing terms which are highly frequent in the corpus results in better precision and recall performance. Penalizing frequently occurring terms gives a better picture of the diseases, symptoms, and medications co-occurring with COPD. Using a mathematical and computational approach rather than purely expert driven approach, large dictionaries of COPD related terms can be assembled in a short amount of time.
Selecting critical features for data classification based on machine learning methods
Journal of Big Data - Tập 7 - Trang 1-26 - 2020
Rung-Ching Chen, Christine Dewi, Su-Wen Huang, Rezzy Eko Caraka
Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.
Big data actionable intelligence architecture
Journal of Big Data - Tập 7 - Trang 1-19 - 2020
Tian J. Ma, Rudy J. Garcia, Forest Danford, Laura Patrizi, Jennifer Galasso, Jason Loyd
The amount of data produced by sensors, social and digital media, and Internet of Things (IoTs) are rapidly increasing each day. Decision makers often need to sift through a sea of Big Data to utilize information from a variety of sources in order to determine a course of action. This can be a very difficult and time-consuming task. For each data source encountered, the information can be redundant, conflicting, and/or incomplete. For near-real-time application, there is insufficient time for a human to interpret all the information from different sources. In this project, we have developed a near-real-time, data-agnostic, software architecture that is capable of using several disparate sources to autonomously generate Actionable Intelligence with a human in the loop. We demonstrated our solution through a traffic prediction exemplar problem.
Regularized Simple Graph Convolution (SGC) for improved interpretability of large datasets
Journal of Big Data - Tập 7 - Trang 1-17 - 2020
Phuong Pho, Alexander V. Mantzaris
Classification of data points which correspond to complex entities such as people or journal articles is a ongoing research task. Notable applications are recommendation systems for customer behaviors based upon their features or past purchases and in academia labeling relevant research papers in order to reduce the reading time required. The features that can be extracted are many and result in large datasets which are a challenge to process with complex machine learning methodologies. There is also an issue on how this is presented and how to interpret the parameterizations beyond the classification accuracies. This work shows how the network information contained in an adjacency matrix allows improved classification of entities through their associations and how the framework of the SGC provide an expressive and fast approach. The proposed regularized SGC incorporates shrinkage upon three different aspects of the projection vectors to reduce the number of parameters, the size of the parameters and the directions between the vectors to produce more meaningful interpretations.
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems
Journal of Big Data - - 2019
Eric Costa, Carlos Costa, Maribel Yasmina Santos
Improving lookup and query execution performance in distributed Big Data systems using Cuckoo Filter
Journal of Big Data - Tập 9 - Trang 1-30 - 2022
Sharafat Ibn Mollah Mosharraf, Muhammad Abdullah Adnan
Performance is a critical concern when reading and writing data from billions of records stored in a Big Data warehouse. We introduce two scopes for query performance improvement. One is to improve the performance of lookup queries after data deletion in Big Data systems that use Eventual Consistency. We propose a scheme to improve lookup performance after data deletion by using Cuckoo Filter. Another scope for improvement is to avoid unnecessary network round-trips for querying in remote nodes in a distributed Big Data cluster when it is known that the nodes do not have requested partition of data. We propose a scheme using probabilistic filters that are looked up before querying remote nodes so that queries resulting in no data can be skipped from passing through the network. We evaluate our schemes with Cassandra using real dataset and show that each scheme can improve performance of lookup queries for up to 2x.
Assessing survival time of heart failure patients: using Bayesian approach
Journal of Big Data - Tập 8 - Trang 1-18 - 2021
Tafese Ashine, Geremew Muleta, Kenenisa Tadesse
Heart failure is a failure of the heart to pump blood with normal efficiency and a globally growing public health issue with a high death rate all over the world, including Ethiopia. The goal of this study was to identify factors affecting the survival time of heart failure patients. To achieve the aim, 409 heart failure patients were included in the study based on data taken from medical records of patients enrolled from January 2016 to January 2019 at Jimma University Medical Center, Jimma, Ethiopia. The Kaplan Meier plots and log-rank test were used for comparison of survival functions; the Cox-PH model and the Bayesian parametric survival models were used to analyze the survival time of heart failure patients using R-software. Integrated nested Laplace approximation methods have been applied. Out of the total heart failure patients in the study, 40.1% died, and 59.9% were censored. The estimated median survival time of patients was 31 months. Using model selection criteria, the Bayesian log-normal accelerated failure time model was found to be appropriate. The results of this model show that age, chronic kidney disease, diabetes mellitus, etiology of heart failure, hypertension, anemia, smoking cigarettes, and stages of heart failure all have a significant impact on the survival time of heart failure patients. The Bayesian log-normal accelerated failure time model described the survival time of heart failure patient's data-set well. The findings of this study suggested that the age group (49 to 65 years, and greater than or equal to 65 years); etiology of heart failure (rheumatic valvular heart disease, hypertensive heart disease, and other diseases); the presence of hypertension; the presence of anemia; the presence of chronic kidney disease; smokers; diabetes mellitus (type I, and type II); and stages of heart failure (II, III, and IV) shortened their survival time of heart failure patients.
An improved deep hashing model for image retrieval with binary code similarities
Journal of Big Data - Tập 11 Số 1
Huawen Liu, Zongda Wu, Minghao Yin, Dunshan Yu, Xiaoyan Zhu, Jungang Lou
Abstract

The exponential growth of data raises an unprecedented challenge in data analysis: how to retrieve interesting information from such large-scale data. Hash learning is a promising solution to address this challenge, because it may bring many potential advantages, such as extremely high efficiency and low storage cost, after projecting high-dimensional data to compact binary codes. However, traditional hash learning algorithms often suffer from the problem of semantic inconsistency, where images with similar semantic features may have different binary codes. In this paper, we propose a novel end-to-end deep hashing method based on the similarities of binary codes, dubbed CSDH (Code Similarity-based Deep Hashing), for image retrieval. Specifically, it extracts deep features from images to capture semantic information using a pre-trained deep convolutional neural network. Additionally, a hidden and fully connected layer is attached at the end of the deep network to derive hash bits by virtue of an activation function. To preserve the semantic consistency of images, a loss function has been introduced. It takes the label similarities, as well as the Hamming embedding distances, into consideration. By doing so, CSDH can learn more compact and powerful hash codes, which not only can preserve semantic similarity but also have small Hamming distances between similar images. To verify the effectiveness of CSDH, we evaluate CSDH on two public benchmark image collections, i.e., CIFAR-10 and NUS-WIDE, with five classic shallow hashing models and six popular deep hashing ones. The experimental results show that CSDH can achieve competitive performance to the popular deep hashing algorithms.

Tổng số: 577   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 10