Accurate, Generalizable, and Practical Behavioral Models to Identify Impending User Exposure to Malicious Websites
Abstract
To keep users safe online, current protections frequently employ blocklists of known malware and phishing websites. However, such defenses suffer from an inherent gap between malicious content creation and its detection, leaving a window where users are left vulnerable. To address this limitation, earlier research has shown that one could use individual user web browsing behavior to identify imminent exposure to malicious content. While existing methods frequently rely on temporal proximity (e.g., aggregating browsing patterns over the recent past), they do not leverage temporal ordering in user browsing, which results in suboptimal performance and is, in practice, inadequate given the low base rates of malware incidence. We introduce network and browser-level features (e.g., page rank, tab browsing time) and a temporal model that captures user behavior through a time-series representation. This not only improves classification performance by a significant margin (between 93% and 145% F1-score improvements) over previous models, but also maintains strong robustness across completely disparate sets of users. More importantly, our method shows strong resilience to concept drift, as performance holds steady over multiple years of testing. We discuss how this method is capable of anticipating future exposure. We also assess the relative importance of each feature to the performance, as well as their impact on false positive rates—whose minimization is critical to foster adoption. Finally, we discuss use cases for such behavior-based models.