2019-11-26: Summary of "Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub"


Figure 1: The Life-Cycle of a Vulnerability (Source: Horawalavithana)
Cyber security attacks can be enabled by the fact that many widely-used applications share open-source libraries. As a result, a vulnerability or software weakness in one of these libraries can have far reaching impact. Once discovered, security experts may announce the vulnerability on a variety of forums, blogs, and social media sites. Cyber-adversaries might also explore these public information channels and private discussion threads on the dark web to identify potential attack targets and ways to exploit them.

In their 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI '19) paper, "Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub", Sameera Horawalavithana, Abhishek Bhattacharjee, Renhao Liu, Nazim Choudhury, Lawrence O. Hall, and Adriana Iamnitchi present a quantitative analysis of user-generated content related to security vulnerabilities on three digital platforms: two social media conversation channels (Reddit and Twitter) and a collaborative software development platform (GitHub). Their analysis shows that while more security vulnerabilities are discussed on Twitter, relevant conversations go viral earlier on Reddit. They also show that the two social media platforms can be used to accurately predict activity on GitHub.


Dataset
The authors investigated security vulnerabilities and their mention in social media over a period of 18 months using a private data set released by DARPA as part of  their 2018 Computational Simulation of Online Social Behavior (SocialSim) challenge. First, Horawalavithana et al. selected Common Vulnerability Exposure (CVE) identifiers published in the National Vulnerability Database (NVD) between January 2015 and May 2018. Second, the authors filtered posts shown on Reddit and Twitter between March 2016 and August 2017 to identify posts which mentioned a CVE ID. Last, they selected repositories from GitHub that were related to the CVE IDs already identified. The longer time frame for the NVD was chosen to allow comparison of the timing of conversations on Reddit and Twitter and the public disclosure of vulnerabilities in the NVD.  The use case and supplemental processing, if any, for each data set are described below.

  • The NVD serves as the standard repository for publicly disclosed security vulnerabilities. The NVD fields of interest include the published date, the CVSS score (0-10), and the attack severity (i.e., critical, high, medium, low). In Figure 1, Horawalavithana et al. describe the life cycle of a vulnerability using three defined phases adapted from related research.
    • Black Risk Phase is the period between initial identification and public disclosure of the vulnerability. The exploitation risk is highest during this phase and discussions might be observed among small communities on public and private forums.
    • Grey Risk Phase occurs when the NVD accepts the vulnerability and continues until the vendor releases an official patch or countermeasure.
    • White Risk Phase covers the time needed to deploy the countermeasure.
  • The Social Media Datasets from Reddit and Twitter consist of conversations and tweets that include at least one CVE ID.  The authors used the regular expression pattern of CVE-\d{4}-\d{4}\d* to match CVE IDs that appeared in posts, comments and tweets. In addition, the dataset was augmented to include re-tweet cascades, sentiment analysis (polarity and subjectivity), and bot detection. Bot driven messages were removed in favor of human responses. Table 1 provides a comparison of the social media platforms in terms of the volume  of security vulnerability mentions.
Table 1: Size of dataset. Activities represent posts (18%) and comments (82%) in Reddit, tweets (76%), re-tweets (19%) and replies (5%) in Twitter, and events (push, issue, pull-request, watch, fork) in GitHub. Communities are represented by subreddits in Reddit, hashtags in Twitter, and software repositories in GitHub. (Source: Horawalavithana)
  • The GitHub Dataset focused on public repositories that have one of the CVE IDs from the NVD dataset in the repository text description or a Git commit message. The same regular expression was used to match CVE IDs in the text descriptions. The authors noted a significant overlap between both the NVD and the social media dataset with a 47% and 3% overlap in the CVE IDs observed on Twitter and Reddit, respectively.

Data Analysis
Previous work in the area of online cyber security discussions suggests that messages shared on social media platforms can be used as early signals to detect security vulnerabilities. With this knowledge, Horawalavithana et al. analyzed the reaction of each social media platform. Further, they attempted to show how information disseminated on Reddit and Twitter relates or compels software development activities in GitHub repositories. Core questions of interest addressed by the authors include:

  1. How do social media platforms compare in terms of these signals on security vulnerabilities?
  2. To what extent are named vulnerabilities discussed on public channels before the official disclosure day?

CVE Mentions in Reddit and Twitter
The authors characterized the social media platforms based on the appearance of CVE IDs. They analyzed the CVE IDs discussed only on Twitter, only on Reddit, and on both platforms.  As shown in Figure 2, 10,257 CVE IDs were mentioned in the Reddit-Twitter dataset. 95% of CVE IDs were mentioned only on Twitter. 0.5% of the CVE IDs were mentioned only on Reddit. 4.5% were mentioned on both platforms.
Figure 2: More security vulnerabilities are discussed on Twitter. (Source: Horawalavithana)
The timing of mentions relative to public disclosure of the vulnerability was used to describe early signals. Of the 10,209 CVE IDs discussed on Twitter, 17% were mentioned before their public disclosure. During the same time frame, of the 460 CVE IDs discussed on Reddit, 51% were mentioned in advance of public disclosure.  Figure 3 shows the daily volume of posts and tweets as related to Day 0 which represents the NVD public disclosure date.  The published date of the message (post/tweet) is relative to NVD public disclosure date of the mentioned CVE ID. Horawalavithana et al. observed that both Reddit and Twitter have mentions of CVE IDs more than a year prior to public disclosure. They also observed a spike in CVE mentions around Day 0 on both platforms.

Figure 3: Both platforms show a peak in the mentions of CVE IDs near their public disclosure (Source: Horawalavithana)
Discussions on Reddit and Twitter were classified by topics suggested by subreddits for Reddit and hashtags for Twitter. As noted in Tables 2 and 3, the majority of CVE IDs found on both Reddit and Twitter were discussed before public disclosure.

Table 3: Top 10 subreddits by the total number of posts published.  (Source: Horawalavithana)
Table 4: Top 10 hashtags by the total number of tweets published. (Source: Horawalavithana)
CVE Mentions in GitHub Actions (Software Development)
Next, the authors considered how GitHub activity typically follows the public disclosure of security vulnerabilities. There were 10,502 distinct CVE IDs included in the text descriptions of GitHub events. As shown in Table 5, most CVE IDs appear in commit messages. While the majority of CVE IDs are mentioned in only one GitHub event, there are some vulnerabilities mentioned in multiple repositories. The pattern of observed GitHub activity over time is shown in Figure 4 based on a vulnerability associated with the Linux kernel. Horawalavithana et al. surmise that spikes in activity sometimes months after public disclosure are likely due to the software development life cycle where vulnerabilities are not addressed until after a major exploit. Through calculations of dynamic time warping (dtw), the authors attempted to determine similarities between time series events. As expected, push events were popular since this is the mechanism used to contribute to a repository. On the other hand, the authors also observed similarities between fork and watch activities which are measures of popularity (dtw is 323 and 263, respectively). Their analysis suggests that only certain types of GitHub activity are influenced by the volume of CVE mentions which they attribute to interest in learning about bug fixes or developing exploit code.

Table 5:  Distribution of distinct CVE IDs as they appeared in GitHub event texts. (Source: Horawalavithana)
Figure 4:  The distribution of GitHub events associated with CVE IDs. The insert presents the number of GitHub events over time that are related to CVE-2015-1805, a vulnerability in the Linux kernel. (Source: Horawalavithana)

Predicting GitHub Activities
Finally, Horawalavithana et al. investigate whether Twitter and Reddit CVE mentions help predict the actual activity on GitHub repositories. Activity in this area strengthens the case that online social media platforms create an ecosystem in which signals travel across platforms.  Predicting GitHub activities may be important because:

  • GitHub hosts many exploits and patches related with CVE IDs.
  • Predictions might reflect the software development activities of an attacker who develops an exploit.
  • Predictions can be used to estimate the availability of a patch related to a security vulnerability.

Machine Learning
The authors trained two machine-learning models to predict GitHub events. A GitHub event consists of the type of action (as listed in Table 5), associated GitHub repository, the identity of a user who performed the action, and the event time-stamp. The model's features include the daily count of posts, daily count of active authors, daily count of active subreddits, and daily counts of comments on Reddit; and daily count of tweets, daily count of tweeting users, daily count of retweets, and daily count of retweeting users. Using a much larger feature value vector, the authors trained a recurrent neural network for each GitHub event type to predict the likelihood of a user action to a given GitHub  repository in a particular hour. Expanded features used for prediction include daily counts of posts, active authors, active subreddits, comments, tweets, and retweets. Horawalavithana et al. used GitHub data from January to May 2017 as training data, the following two months as validation data, and the month of August as test data.

Prediction Results
Since these events measure the popularity of a GitHub repository, prediction results were reported by the authors only on fork and watch events in GitHub. The distribution of forks and watches are presented Figure 5a and 5b. The authors measured the similarities between ground truth and their simulated predictions of GitHub events using Jensen-Shannon (js) divergence; a statistical method which quantifies the difference between finite random variables using a range of 0 (indistinguishable) and 1.  Horawalavithana et al. determined their predicted distribution of fork and watch events were nearly equivalent to ground truth with low js divergence scores of 0.0029 and 0.0020, respectively. Further, the coefficient of determination (R squared) which measures the goodness of fit of a model was determined to be 0.6300 and 0.6067 for the predicted events where 1 is considered a perfect fit. Finally, the authors examined their predictions via a time series, Figure 6, tracks the growth of the most active GitHub repository in August 2017. Horawalavithana et al. concluded their simulations followed accurately for the first week but observed limitations in the predictive power of their model over longer intervals.


Figure 5: GitHub Popularity: the distribution of a) Fork events and b) Watch events across GitHub repositories. (Source: Horawalavithana)
Figure 6: The growth of the most active GitHub repository by the number of daily events occurring in August 2017. (Source: Horawalavithana)

Summary Discussion
This paper compares the volume and timing of security vulnerability mentions on three social platforms over a period of 18 months, from March 2016 to August 2017. In addition, Horawalavithana et al. present machine-learning models that predict the patterns of popularity and engagement level activities in GitHub using information gleaned from Reddit and Twitter. The authors theorized and concluded that diverse online platforms are interconnected such that the activities in one platform can be predicted based on the activities in others. Their conclusions were based on the following observations:


  • The volume of security vulnerability mentions is significantly higher on Twitter than on Reddit and appear slightly earlier. This suggests Twitter is a better platform to monitor for early vulnerability alerts.
  • Most of vulnerability mentions on Reddit occur before public disclosure. Deeper levels of discussion among professional communities were also noted on this platform.
  • The majority of GitHub activity occurs after public disclosure of a vulnerability. Signals from Reddit and Twitter may be useful for predicting events in repositories which mention vulnerabilities. Here, Horawalavithana et al. stress they are not suggesting that activity on Twitter and Reddit directly affects or drives activity observed on GitHub.

The findings of  Horawalavithana et al.  could be practically applied to:

  • Advance or calibrate security alert tools based on information from multiple social media platforms.
  • Coordinate software development activities with the lessons learned from social-media information.


-- Corren McCoy (@correnmccoy)

Horawalavithana, S., Bhattacharjee, A., Liu, R., Choudhury, N., O Hall, L., & Iamnitchi, A. (2019, October). Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub. In IEEE/WIC/ACM International Conference on Web Intelligence (pp. 200-207). ACM. doi: 10.1145/3350546.3352519

Comments