One of the challenges with learning or performing data analysis (statistical analysis, visualizations, and machine learning) is the "getting the data" part. Where can data be obtained, is it open, does it contain known (labeled) samples? Getting data that can be used across analyses seems to be a problem within the Information Security industry due to the paranoia and tight-lipped attitude when it comes to sharing data; especially data that contains any kind of malware or breach information. However, there are lots of people willing to share at least some information.
There have been a couple of datasets related to security that have become the de-facto standard for academic security related data/research; the KDD Cup for 1999 and 2010, but if you don’t want network or web classification data what do you do? Several sites offer specific types of data compiled from various sources. Some great examples of this are:
While it’s fantastic that people are producing data, it’s hard to find if you don’t know where to start looking.
Enter SecRepo, a site that attempts to get people connected to the data they want for the use cases they’d like to explore. A central repository open to hosting security-related data in addition to pointing out the great resources available. Our long term vision involves creating open data, connecting researchers with other datasets, and even hosting analyses that leverage the created/linked data.
We’re open to any contributions from data you’d like hosted to datasets you’ve found interesting and useful.