International Journal of Data Science and Analytics
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Sắp xếp:
Training data scientists: a few challenges
International Journal of Data Science and Analytics - Tập 6 - Trang 201-204 - 2018
Faced with a large shortage of data scientists talents, initial education is not enough to fill the gap: lifelong learning is a necessity. An example of such a training is given.
Simultaneously feature selection and parameters optimization by teaching–learning and genetic algorithms for diagnosis of breast cancer
International Journal of Data Science and Analytics - - 2024
Currently, development of early and accurate breast cancer (BC) prediction models using computer-aided tools has proven to be beneficial, which in turn low mortality rate related to this disease. However, feature selection (FS) is a challenging task for the identification and characterization of cancers that increase the susceptibility to common complex multifactorial BC diseases, especially when dealing with clinical treatment. Most of the previous FS techniques does not handle important characterization such as removing irrelevant and/or redundant features separately. According to the past research on FS, several evolutionary algorithms have been proposed to address FS problems, but they have to fail for classifying BC survival types. In order to address before-mentioned issues, numerous hybridized models have been intended for selecting best features in effort to increase the accuracy of breast cancer predictive models. It may be cumbersome to obtain the perfect parameters for optimal performance. To resolve the deficiencies of past diagnostic system, in this paper, hybrid teaching–learning optimization (TLBO) and genetic algorithm (GA)-based is proposed consistent wrapper strategy called TLBOG to improve the reliability of evolutionary algorithms. The aim of using GA here is to tackle slow convergence rate and improve exploitation search capability found by TLBO. Most importantly, goal of our approach is to optimize the parameters of support vector machines to have high accuracy in contrast to other machine learning models and select best features subset simultaneously. From the performance evaluation results, we understand that proposed approach is significantly higher than conventional wrapper techniques in terms of accuracy, sensitivity, precision, and F-measure in the WBCD and WDBC databases.
Anonymity and security improvements in heterogeneous connected vehicle networks
International Journal of Data Science and Analytics -
CARE: coherent actionable recourse based on sound counterfactual explanations
International Journal of Data Science and Analytics - - Trang 1-26 - 2022
Counterfactual explanation (CE) is a popular post hoc interpretability approach that explains how to obtain an alternative outcome from a machine learning model by specifying minimum changes in the input. In line with this context, when the model’s inputs represent actual individuals, actionable recourse (AR) refers to a personalized CE that prescribes feasible changes according to an individual’s preferences. Hence, the quality of ARs highly depends on the soundness of underlying CEs and the proper incorporation of user preferences. To generate sound CEs, several data-level properties, such as proximity and connectedness, should be taken into account. Meanwhile, personalizing explanations demands fulfilling important user-level requirements, like coherency and actionability. The main obstacles to inclusive consideration of the stated properties are their associated modeling and computational complexity as well as the lack of a systematic approach for making a rigorous trade-off between them based on their importance. This paper introduces CARE, an explanation framework that addresses these challenges by formulating the properties as intuitive and computationally efficient objective functions, organized in a modular hierarchy and optimized using a multi-objective optimization algorithm. The devised modular hierarchy enables the arbitration and aggregation of various properties as well as the generation of CEs and AR by choice. CARE involves individuals through a flexible language for defining preferences, facilitates their choice by recommending multiple ARs, and guides their action steps toward their desired outcome. CARE is a model-agnostic approach for explaining any multi-class classification and regression model in mixed-feature tabular settings. We demonstrate the efficacy of our framework through several validation and benchmark experiments on standard data sets and black box models.
Resampling-based predictive simulation framework of stochastic diffusion model for identifying top-K influential nodes
International Journal of Data Science and Analytics - Tập 9 - Trang 175-195 - 2019
We address a problem of efficiently estimating the influence of a node in information diffusion over a social network. Since the information diffusion is a stochastic process, the influence degree of a node is quantified by the expectation, which is usually obtained by very time-consuming many runs of simulation. Our contribution is that we proposed a framework for predictive simulation based on the leave-N-out cross-validation technique that well approximates the error from the unknown ground truth for two target problems: one to estimate the influence degree of each node, and the other to identify top-K influential nodes. The method we proposed for the first problem estimates the approximation error of the influence degree of each node, and the method for the second problem estimates the precision of the derived top-K nodes, both without knowing the true influence degree. We experimentally evaluate the proposed methods using the three real-world networks and show that they can serve as a good measure to solve the target problems with far fewer runs of simulation ensuring the accuracy when the leave-half-out cross-validation, i.e., N is the half of the number of runs, is used, which means that one can identify the influential nodes without knowing exactly their influence degree in good accuracy.
Speeding-up node influence computation for huge social networks
International Journal of Data Science and Analytics - Tập 1 - Trang 3-16 - 2016
We address the problem of efficiently estimating the influence degree for all the nodes simultaneously in the network under the SIR setting. The proposed approach is a further improvement over the existing work of the bond percolation process which was demonstrated to be very effective, i.e., three orders of magnitude faster than direct Monte Carlo simulation, in approximately solving the influence maximization problem. We introduce two pruning techniques which improve computational efficiency by an order of magnitude. This approach is generic and can be instantiated to any specific diffusion model. It does not require any approximations or assumptions to the model that were needed in the existing approaches. We demonstrate its effectiveness by extensive experiments on two large real social networks. Main finding includes that different network structures have different epidemic thresholds and the node influence can identify influential nodes that the existing centrality measures cannot. We analyze how the performance changes when the network structure is systematically changed using synthetically generated networks and identify important factors that affect the performance.
Prescriptive analytics with differential privacy
International Journal of Data Science and Analytics - Tập 13 - Trang 123-138 - 2021
Prescriptive analytics is a mechanism that provides the best set of actions to be taken to prevent undesirable outcomes for a given instance. However, this mechanism is prone to privacy breaches if an adversary with subsidiary data is allowed multiple query access to it. So, we propose a differential privacy mechanism in prescriptive analytics to preserve data privacy. Differential privacy can be achieved with the help of sensitivity of the given actions. Roughly speaking, sensitivity is the maximum change in the given set of actions with respect to the change in the given instances. However, a general analytical form for the sensitivity of the prescriptive analytics mechanism is difficult to derive. So, we formulate a nested constrained optimization to solve the problem. We use synthetic data in the experiments to validate the behavior of the differential privacy mechanism with respect to different privacy parameter settings. The experiments with two real-world datasets—Student Academic Performance and Reddit dataset, demonstrate the usefulness of our proposed method in education and social policy design. We also propose a new evaluation measure called the prescription success rate to further investigate the significance of our proposed method.
Optimizing network lifespan through energy harvesting in low-power lossy wireless networks
International Journal of Data Science and Analytics - - Trang 1-15 - 2023
The low-power lossy network’s performance is strongly dependent on the battery life of each wireless node. Furthermore, the routing protocol for such a network does not address the energy issue that each node in the network faces. Novel techniques or approaches are necessary to handle the energy problem that nodes encounter in low-power lossy networks in order to extend the overall lifespan of each node in the network. This study focuses on improving the lifespan of energy-constrained nodes in low-power lossy networks by implementing a solar energy harvesting module. The current routing protocols used in such networks adequately address the energy problem faced by nodes. The study uses the Cooja simulator with ETX and OF0 protocols to analyze the energy harvested by nodes with 1% battery life and evaluates the network performance with 25, 50, and 100 nodes. The analysis includes throughput, packet delivery ratio, and network connectivity. The results show that OF0 outperforms ETX in all three scenarios, and performs well in the other parameters. With OF0, the network lasts for 12:50:58, 12:46:58, and 8:12:25 (non-energy harvesting), and 15:59:01, 15:12:07, and 11:18:23 (energy harvesting) for 25, 50, and 100 nodes, respectively. The study suggests that novel techniques, such as energy harvesting, could significantly improve the overall lifespan of nodes in low-power lossy networks.
Re-interpreting rules interpretability
International Journal of Data Science and Analytics - - Trang 1-21 - 2023
Trustworthy machine learning requires a high level of interpretability of machine learning models, yet many models are inherently black-boxes. Training interpretable models instead—or using them to mimic the black-box model—seems like a viable solution. In practice, however, these interpretable models are still unintelligible due to their size and complexity. In this paper, we present an approach to explain the logic of large interpretable models that can be represented as sets of logical rules by a simple, and thus intelligible, descriptive model. The coarseness of this descriptive model and its fidelity to the original model can be controlled, so that a user can understand the original model in varying levels of depth. We showcase and discuss this approach on three real-world problems from healthcare, material science, and finance.
On the discovery of spatial-temporal fluctuating patterns
International Journal of Data Science and Analytics - Tập 8 - Trang 57-75 - 2018
In this paper, we explore a new mining paradigm, called spatial-temporal fluctuating patterns (abbreviated as STFs), to discover potentially fluctuating and useful feature sets from the spatial-temporal data. These feature sets have some properties which are variant as time advances. Once STFs are discovered, we can find the turning points of patterns, which enables anomaly detection and transformation discovery over time. For example, the discovery of STFs can possibly figure out the phenomenon of virus variation during the epidemic outbreak, further providing the government with clues for the epidemic control. Therefore, we develop a union-based mining with the downward-closure structure to speed up the spatial-temporal mining process and dynamically compute fluctuating patterns. As shown in our experimental studies, the proposed framework can efficiently discover STFs on a real epidemic disease dataset, showing its prominent advantages to be utilized in real applications.
Tổng số: 360
- 1
- 2
- 3
- 4
- 5
- 6
- 10