Seml: A Semantic LSTM Model for Software Defect Prediction
Citations Over TimeTop 1% of 2019 papers
Abstract
Software defect prediction can assist developers in finding potential bugs and reducing maintenance cost. Traditional approaches usually utilize software metrics (Lines of Code, Cyclomatic Complexity, etc.) as features to build classifiers and identify defective software modules. However, software metrics often fail to capture programs' syntax and semantic information. In this paper, we propose Seml, a novel framework that combines word embedding and deep learning methods for defect prediction. Specifically, for each program source file, we first extract a token sequence from its abstract syntax tree. Then, we map each token in the sequence to a real-valued vector using a mapping table, which is trained with an unsupervised word embedding model. Finally, we use the vector sequences and their labels (defective or non-defective) to build a Long Short Term Memory (LSTM) network. The LSTM model can automatically learn the semantic information of programs and perform defect prediction. The evaluation results on eight open source projects show that Seml outperforms three state-of-the-art defect prediction approaches on most of the datasets for both within-project defect prediction and cross-project defect prediction.
Related Papers
- → Generalized vulnerability extrapolation using abstract syntax trees(2012)228 cited
- → Source Code Plagiarism Detection Based on Abstract Syntax Tree Fingerprintings(2022)6 cited
- → A tool for detecting dependency violation of layered architecture in source code(2014)2 cited
- → A Token Oriented Measurement Method of Source Code Similarity(2014)1 cited
- → Dynamic Syntax Tree Model for Enhanced Source Code Representation(2023)1 cited