We share a number of software tools and datasets with the research community. The listed items below have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Please contact me if the links become unreachable.

Reproducible Memory Corruption Vulnerabilities

To the reproducibility of crowd-reported security vulnerabilities, we collected and analyzed 368 memory corruption vulnerabilities discovered from 2001 to 2017. To facilitate future research, we share our full dataset with the research community. The dataset includes 291 vulnerabilities with CVE-IDs and 77 vulnerabilities without CVE- IDs. For each vulnerability, we have filled in the missing pieces of information, annotated the issues we encountered during the reproduction, and created the appropriate Dockerfiles for each case. Each vulnerability report contains structured information fields (in HTML and JSON), detailed instructions on how to reproduce the vulnerability, and fully-tested PoC exploits. In the repository, we have also included the pre-configured virtual machines with the appropriate environments. To the best of our knowledge, this is the largest public ground-truth dataset of real-world vulnerabilities which were manually reproduced and verified. You can check out the dataset here under this Download Link. The related paper is: [USENIX Security'18].

Password Datasets

To study the security threat of leaked passwords from data breaches, we have collected 107 password datasets leaked during 2008 to 2016 (e.g., LinkedIn, MySpace, Adobe, Ashley Madison). We linked users across different datasets to study password reuse and modification patterns. In total, the dataset covers 28.8 million users and their 61.5 million passwords over 8 years. Please find more details at the [Project Website]. The related paper is: [CODASPY'18].

Social Livestream Datasets

Datasets collected from a social livestreaming service (Twitter's Periscope) in 2015. The dataset contains This dataset contains 13,894,852 broadcasts and in total 416,207,256 comments, 6,101,042,415 hearts and other detailed interaction metadata. Please check out the [Project Website] for the dataset details. The related paper is: [IMC'16].

Clickstream User Behavior Model

In this project, we build an unsupervised system to capture dominating user behaviors from clickstream data (traces of users' click events), and visualize the detected behaviors in an intuitive manner. The system identifies "clusters" of similar users by partitioning a similarity graph (nodes are users; edges are weighted by clickstream similarity). The partitioning process leverages iterative feature pruning to capture the natural hierarchy within user clusters and produce intuitive features for visualizing and understanding captured user behaviors. The code and sample data are available at the [Project Website]. Related papers are: [CHI'16], [TWEB'17], [USENIX Security'13].

10 Location De-anonymization Algorithms

To evaluate the performance of de-anonymization algorithms on real-world datasets, we re-implemented 7 de-anonymization algorithms published in the last 10 years: POIS [WWW'16], ME [AIHC'16], HIST [TIFS'16], WYCI [WOSN'14], MSQ [TON'13], HMM [IEEE SP'11], NFLX [IEEE SP'08]. We also introduced 3 new algorithms into this collection. The code and sample data is available at [Github]. The related paper is: [NDSS'18].