Artificial Intelligence and Software Engineering

Classifying Code Commits with Convolutional Neural Networks (IJCNN20)

Developers change software programs for various purposes (e.g., bug fixes, feature additions, and code refactorings), but the intents of code changes are often not recorded or are poorly documented. To automatically infer the change intent of each program commit (i.e., a set of code changes), existing work classifies commits based on commit messages and/or the sheer counts of edited files, lines, or abstract syntax tree (AST) nodes. However, none of these tools reason about the syntactic or semantic dependencies between co-applied changes, neither do they adopt any deep learning method. To better characterize program commits, in this paper, we present CClassifier—a new approach that classifies commits by (1) using advanced static program analysis to comprehend relationship between co-applied edits, (2) representing edits and their relationship via graphs, and (3) applying convolutional neural networks (CNN) to classify those graphs.

Compared with prior work, CClassifier extracts a richer set of features from program changes; it is the first to classify program commits using CNN. For evaluation, we prepared a benchmark that contains 7,414 code changes from 5 open-source Java projects. On this benchmark, we empirically compared CClassifier and the state-of-the-art approach with five-fold cross validation. On average, when predicting bug-fixing commits within the same projects, CClassifier improved the prediction ac- curacy from 70% to 72%. More importantly, prior work seldom identifies feature-addition commits; CClassifier can successfully identify such commits in a lot more scenarios. Our evaluation shows that CClassifier outperforms prior work due to its usage of advanced program analysis and CNN.

SAIS: Self-Adaptive Identification of Security Bug Reports (TDSC19)

Among various bug reports (BRs), security bug reports (SBRs) are unique because they require immediate concealment and fixes. When SBRs are not identified in time, attackers can exploit the vulnerabilities. Prior work identifies SBRs via text mining, which requires a predefined keyword list and trains a classifier with known SBRs and non-security bug reports (NSBRs). The former approach is not reliable, because (1) as the context of security vulnerabilities and terminology of SBRs change over time, the predefined list will become outdated; and (2) users may have insufficient SBRs for training. This paper introduces a semi-supervised learning-based approach, SAIS, to adaptively and reliably identify SBRs. Given a project’ BRs containing some labeled SBRs, many more NSBRs, and unlabeled BRs, SAIS iteratively mines keywords, trains a classifier based on the keywords from the labeled data, classifies unlabeled BRs, and augments its training data with the newly labeled BRs. Our evaluation shows that SAIS is useful for identifying SBRs.

Automatic Clone Recommendation for Refactoring Based on the Present and the Past (ICSME18)

When many clones are detected in software programs, not all clones are equally important to developers. To help developers refactor code and improve software quality, various tools were built to recommend clone-removal refactorings based on the past and the present information, such as the cohesion degree of individual clones or the co-evolution relations of clone peers. The existence of these tools inspired us to build an approach that considers as many factors as possible to more accurately recommend clones. This paper introduces CRec, a learning-based approach that recommends clones by extracting features from the current status and past history of software projects. Given a set of software repositories, CRec first automatically extracts the clone groups historically refactored (R-clones) and those not refactored (NR-clones) to construct the training set. CRec extracts 34 features to characterize the content and evolution behaviors of individual clones, as well as the spatial, syntactical, and co-change relations of clone peers. With these features, CRec trains a classifier that recommends clones for refactoring.

We designed the largest feature set thus far for clone recommendation, and performed an evaluation on six large projects. The results show that our approach suggested refactorings with 83% and 76% F-scores in the within-project and cross-project settings. CRec significantly outperforms a state-of-the-art similar approach on our data set, with the latter one achieving 70% and 50% F-scores. We also compared the effectiveness of different factors and different learning algorithms.

CCLearner: A Deep Learning-Based Clone Detection Approach (ICSME17)

Programmers produce code clones when developing software. By copying and pasting code with or without modification, developers reuse existing code to improve programming productivity. However, code clones present challenges to software maintenance: they may require consistent application of the same or similar bug fixes or program changes to multiple code locations. To simplify the maintenance process, various tools have been proposed to automatically detect clones. Some tools tokenize source code, and then compare the sequence or frequency of tokens to reveal clones. Some other tools detect clones using tree-matching algorithms to compare the Abstract Syntax Trees (ASTs) of source code. In this paper, we present CCLearner, the first solely token-based clone detection approach leveraging deep learning. CCLearner extracts tokens from known method-level code clones and non-clones to train a classifier, and then uses the classifier to detect clones in a given codebase.

To evaluate CCLearner, we reused BigCloneBench, an existing large benchmark of real clones. We used part of the benchmark for training and the other part for testing. We observed that CCLearner effectively detected clones. With the same data set, we conducted a systematic comparison experiment between CCLearner and three popular clone detection tools. Compared with the approaches not using deep learning, CCLearner achieved competitive clone detection effectiveness with low time cost.