A Novel Fine-Grained Source Code Vulnerability Detection Model via Joint Token and Statement Representation Learning
Abstract
With the increasing amount of code and growing complexity of software systems, defects in source code can lead to significant security risks—for example, malicious intrusions, data breaches, compromised availability, and erroneous scientific computation results—making their detection crucial. Currently, mainstream code defect detection methods are divided into two categories: graph neural network (GNN)-based detection methods and sequence-based detection methods. Both categories have achieved tremendous success in this field; however, each also suffers from certain shortcomings. Graph-based detection methods typically face issues such as huge memory overhead for graph construction, over-smoothing, and incomplete utilization of heterogeneous edge information. Sequence model-based detection methods generally treat code as a regular text sequence, learning only token-level features while ignoring the structural information of the code, which results in suboptimal detection performance. Moreover, these methods rarely support line-level vulnerability detection. To address these issues, this paper proposes a novel sequence model-based detection method that simultaneously learns both token-level and statement-level feature representations and supports line-level detection, thereby significantly enhancing detection capabilities. The proposed method achieves an F1 score of 92.71% for function-level detection and a top-5 accuracy of 61% for line-level vulnerability detection on a public dataset.