Empirical Software Engineering

Công bố khoa học tiêu biểu

Sắp xếp:  
An exploratory study of api changes and usages based on apache and eclipse ecosystems
Empirical Software Engineering - Tập 21 - Trang 2366-2412 - 2015
Wei Wu, Foutse Khomh, Bram Adams, Yann-Gaël Guéhéneuc, Giuliano Antoniol
Frameworks are widely used in modern software development to reduce development costs. They are accessed through their Application Programming Interfaces (APIs), which specify the contracts with client programs. When frameworks evolve, API backward-compatibility cannot always be guaranteed and client programs must upgrade to use the new releases. Because framework upgrades are not cost-free, observing API changes and usages together at fine-grained levels is necessary to help developers understand, assess, and forecast the cost of each framework upgrade. Whereas previous work studied API changes in frameworks and API usages in client programs separately, we analyse and classify API changes and usages together in 22 framework releases from the Apache and Eclipse ecosystems and their client programs. We find that (1) missing classes and methods happen more often in frameworks and affect client programs more often than the other API change types do, (2) missing interfaces occur rarely in frameworks but affect client programs often, (3) framework APIs are used on average in 35 % of client classes and interfaces, (4) most of such usages could be encapsulated locally and reduced in number, and (5) about 11 % of APIs usages could cause ripple effects in client programs when these APIs change. Based on these findings, we provide suggestions for developers and researchers to reduce the impact of API evolution through language mechanisms and design strategies.
Variational satisfiability solving: efficiently solving lots of related SAT problems
Empirical Software Engineering - Tập 28 Số 1 - 2023
Jeffrey M. Young, Paul Maximilian Bittner, Eric Walkingshaw, Thomas Thüm
AbstractIncremental satisfiability (SAT) solving is an extension of classic SAT solving that enables solving a set of related SAT problems by identifying and exploiting shared terms. However, using incremental solvers effectively is hard since performance is sensitive to the input order of subterms and results must be tracked manually. For analyses that generate sets of related SAT problems, such as those in software product lines, incremental solvers are either not used or their use is not clearly described in the literature. This paper translates the ordering problem to an encoding problem and automates the use of incremental solving. We introduce variational SAT solving, which differs from incremental solving by accepting all related problems as a single variational input and returning all results as a single variational output. Variational solving syntactically encodes differences in related SAT problems as local points of variation. With this syntax, our approach automates the interaction with the incremental solver and enables a method to automatically optimize sharing in the input. To evaluate these ideas, we formalize a variational SAT algorithm, construct a prototype variational solver, and perform an empirical analysis on two real-world datasets that applied incremental solvers to software evolution scenarios. We show, assuming a variational input, that the prototype solver scales better for these problems than four off-the-shelf incremental solvers while also automatically tracking individual results.
Bugs in machine learning-based systems: a faultload benchmark
Empirical Software Engineering - Tập 28 - Trang 1-33 - 2023
Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, Zhen Ming (Jack) Jiang
The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs’ lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs’ origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.
To what extent do DNN-based image classification models make unreliable inferences?
Empirical Software Engineering - Tập 26 - Trang 1-40 - 2021
Yongqiang Tian, Shiqing Ma, Ming Wen, Yepang Liu, Shing-Chi Cheung, Xiangyu Zhang
Deep Neural Network (DNN) models are widely used for image classification. While they offer high performance in terms of accuracy, researchers are concerned about if these models inappropriately make inferences using features irrelevant to the target object in a given image. To address this concern, we propose a metamorphic testing approach that assesses if a given inference is made based on irrelevant features. Specifically, we propose two metamorphic relations (MRs) to detect such unreliable inferences. These relations expect (a) the classification results with different labels or the same labels but less certainty from models after corrupting the relevant features of images, and (b) the classification results with the same labels after corrupting irrelevant features. The inferences that violate the metamorphic relations are regarded as unreliable inferences. Our evaluation demonstrated that our approach can effectively identify unreliable inferences for single-label classification models with an average precision of 64.1% and 96.4% for the two MRs, respectively. As for multi-label classification models, the corresponding precision for MR-1 and MR-2 is 78.2% and 86.5%, respectively. Further, we conducted an empirical study to understand the problem of unreliable inferences in practice. Specifically, we applied our approach to 18 pre-trained single-label image classification models and 3 multi-label classification models, and then examined their inferences on the ImageNet and COCO datasets. We found that unreliable inferences are pervasive. Specifically, for each model, more than thousands of correct classifications are actually made using irrelevant features. Next, we investigated the effect of such pervasive unreliable inferences, and found that they can cause significant degradation of a model’s overall accuracy. After including these unreliable inferences from the test set, the model’s accuracy can be significantly changed. Therefore, we recommend that developers should pay more attention to these unreliable inferences during the model evaluations. We also explored the correlation between model accuracy and the size of unreliable inferences. We found the inferences of the input with smaller objects are easier to be unreliable. Lastly, we found that the current model training methodologies can guide the models to learn object-relevant features to certain extent, but may not necessarily prevent the model from making unreliable inferences. We encourage the community to propose more effective training methodologies to address this issue.
Empirical studies in software and systems traceability
Empirical Software Engineering - Tập 22 - Trang 963-966 - 2017
Patrick Mäder, Rocco Olivetto, Andrian Marcus
Introduction to the special issue on software analysis, evolution, and reengineering
Empirical Software Engineering - Tập 24 - Trang 329-331 - 2019
Gabriele Bavota, Andrian Marcus
Refactoring with domain-driven design in an industrial context
Empirical Software Engineering - Tập 28 - Trang 1-36 - 2023
Ozan Özkan, Önder Babur, Mark van den Brand
Software developers need to constantly work on evolving the structure and the stability of the code due to changing business needs of the product. There are various refactoring approaches in industry which promise improvements over source code composition and maintainability. In our research, we want to improve the maintainability of an existing system through refactoring using Domain-Driven Design (DDD) as a software design approach. We also aim for providing empirical evidence on its effect on maintainability and the challenges as perceived by developers. In this study, we applied the action research methodology, which facilitates close academia-industry collaboration and regular presence in the studied product. We utilized focus groups to discover problems of the existing system with a qualitative approach. We reviewed the subject codebase to construct our own expert opinion as well and identified problems in the codebase and matched them with the ones raised by engineers in the team. We refactored the existing software system according to DDD principles. To measure the effects of our actions, we utilized Technology Acceptance Model (mTAM) questionnaire, and also semi-structured interviews with the development team for data collection, and card sorting methodology for qualitative analysis. For minimizing bias that might affect our results with the existing software engineers in the team, we extended our measurement with three new joiner software engineers in the team through the think aloud protocol. We have identified that engineers mostly gave positive answers to our interview questions, which are mapped to software maintainability metrics defined by ISO/IEC 25010. Our DDD refactoring scored 85 in PU and 83 in PEU, leading to an overall mTAM score of 84. This means acceptable on the acceptability scale, B on the grade scale, and good on the adjective rating scale. Our research led us to conclude that a powerful design approach, like DDD, is an effective tool for restructuring and resolving software issues in this situation. It offers standardization to the software and the refactoring efforts. We realized that DDD entails a certain degree of complexity and cognitive load, which is a barrier for software engineers, but they are aware of its benefits.
Rap4DQ: Learning to recommend relevant API documentation for developer questions
Empirical Software Engineering - - 2022
Yi Li, Shaohua Wang, Wenbo Wang, Tien N. Nguyen, Yan Wang, Xinyue Ye
A graph-based code representation method to improve code readability classification
Empirical Software Engineering - Tập 28 - Trang 1-26 - 2023
Qing Mi, Yi Zhan, Han Weng, Qinghang Bao, Longjie Cui, Wei Ma
Code readability is crucial for developers since it is closely related to code maintenance and affects developers’ work efficiency. Code readability classification refers to the source code being classified as pre-defined certain levels according to its readability. So far, many code readability classification models have been proposed in existing studies, including deep learning networks that have achieved relatively high accuracy and good performance. However, in terms of representation, these methods lack effective preservation of the syntactic and semantic structure of the source code. To extract these features, we propose a graph-based code representation method. Firstly, the source code is parsed into a graph containing its abstract syntax tree (AST) combined with control and data flow edges to reserve the semantic structural information and then we convert the graph nodes’ source code and type information into vectors. Finally, we train our graph neural networks model composing Graph Convolutional Network (GCN), DMoNPooling, and K-dimensional Graph Neural Networks (k-GNNs) layers to extract these features from the program graph. We evaluate our approach to the task of code readability classification using a Java dataset provided by Scalabrino et al. (2016). The results show that our method achieves 72.5% and 88% in three-class and two-class classification accuracy, respectively. We are the first to introduce graph-based representation into code readability classification. Our method outperforms state-of-the-art readability models, which suggests that the graph-based code representation method is effective in extracting syntactic and semantic information from source code, and ultimately improves code readability classification.
Eliciting user requirements using Appreciative inquiry
Empirical Software Engineering - Tập 16 - Trang 733-772 - 2011
Carol K. Gonzales, Gondy Leroy
For decades and still today, software development projects have failed because they do not meet the needs of users, are over-budget, and abandoned. To help address this problem, the user requirements elicitation process was modified based on principles of Appreciative Inquiry. Appreciative Inquiry, commonly used in organizational development, aims to build organizations, processes, or systems based on success stories using a hopeful vision for an ideal future. In four studies, Appreciative Inquiry was evaluated for its effectiveness with eliciting user requirements in different circumstances. In the first two studies, it was compared with brainstorming, a traditional approach, with end-users (study 1) and proxy-users (study 2). The third study was a quasi-experiment comparing the use of Appreciative Inquiry in different phases in the software development cycle (study 3). The final (fourth) study combined all lessons learned using Appreciative Inquiry in a cross-case comparison study of 3 cases to gain additional understanding of the requirements gathered during various project phases. In each of the four studies, the requirements gathered, developer and user attitudes, and the Appreciative Inquiry process itself were evaluated. Requirements were evaluated for their quantity and type regardless of whether they were implemented or not. Attitudes were evaluated for commitment to the requirements and project using process feedback. The Appreciative Inquiry process was evaluated with differing groups, projects, and project phases to determine how and when it is best applied. Potentially interceding factors were also evaluated including: team effectiveness, Emotional Intelligence, and perceived stress. Appreciative Inquiry produced positive results for the participants, the requirements obtained, and the general requirements eliciting-process. Appreciative Inquiry demonstrated benefits to the requirements gathered by increasing the number of unique requirements as well as identifying more quality-based (non-functional) and forward-looking requirements. It worked best when there was time for participants to reflect on the thought-provoking questions and when the facilitator was knowledgeable of the subject-matter and had extra time to extract and translate the requirements. The participants (end-users and developers) expressed improved project understanding. End-users participated consistently with immediate buy-in and enthusiasm, especially those users who were technically-inhibited. We conclude that Appreciative Inquiry can augment existing methods by presenting a positive and future aspect for a proposed system resulting in improved user requirements.
Tổng số: 1,119   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 112