MLSec Project MLSec Project is a select community of like-minded individuals that want to work together on using machine learning and data science in information security. It features open source projects, blog posts and community content on this subject. STIX, Stones, and Standardization Woes Mon, 15 Jun 2015 15:00:00 -0700 <img src=//,f_auto,h_2000,q_80,w_1200/v1/355914/xkcd-standards_lgutxq.png></img><br><div>It is with great excitement that I have become a voting member on the new STIX/TAXII/CybOX standard technical committee! The standards are now being transitioned from DHS and MITRE's wings and to a broader community under OASIS so that they can become <strong>actual</strong> standards, and benefit from decades of experience from a great number of professionals from outside the USG. The MLSec Project was invited in recognition of the work we have been doing to analyze and demystify Threat Intelligence feeds.</div><p>I have privately done my fair share of jokes in expense of these proto-standards, such as the ever popular <strong>"Why did it have to be XML?"</strong>, and the revealing <strong>"I just wanted an IP address, not all this weird structure"</strong>. I am also sure some of my closest friends what are very influential and knowledgeable on the DFIR field will cast disapproving side-glances.</p><p>The fact is that, just like with the usage of word "cyber", no amount of eye rolling and snark will keep those standards from being implemented on a great number of security products. And, as many of you will agree, I'd rather have a good standard that will be actually useful than the mish-mash of custom data feeds we have today. Even worse would be a <strong>bad</strong> standard, where everyone is "forced" to implement, but then crate hidden workarounds internally to make it work with whatever they are doing.</p><h3><div>Trust me, I am a (cyber data) engineer</div></h3><p>However, this is not a defense of the current proto-standards or what they bring to the table now. There are at least 3 things that I have heard and experienced that I believe should be addressed on the technical committee. As we are still spinning up the standard making machine there at OASIS, I'd rather present this to you first:</p><p>1) <strong>"The...<a href=>Read More</a> The MLSec Project guide to FIRST Conference 2015 Mon, 15 Jun 2015 06:26:28 -0700 <div>We at MLSec Project are very excited to be presenting on FIRST for the first time (see what I did there)! I am personally even more psyched that this time I will be co-presenting with my colleague Alexandre Sieira, who has grown to be a committer and principal of MLSec Project with several contributions on our "not ready for open source" experiments at the moment. Did I mention that he also works for a <a href="" target="_blank">great startup</a>?</div>But enough about us. I have reviewed the talks list for FIRST 2015 and here are the ones you should keep an eye out if you are interested in data science and machine learning applied to security. Let me know on Twitter if I missed something!<h3><div>Data Munging and Architecture</div></h3><p><a href="" target="_blank">Building instantly exploitable protection for yourself and your partners against targeted cyber threats using MISP</a> - Mr. Andras IKLODY (CIRCL) - June 15th, 2015 11:00 – 12:00</p> <p>Should be a good introduction overview to MISP, which is an open-source threat intelligent platform of sorts. I have not been keeping up with it's development, so it should be worthwhile to see how it is going.</p><p><a href="" target="_blank">Ce1sus: A Contribution to an Improved Cyber Threat Intelligence Handling</a> - Mr. Jean-Paul WEBER ( - June 16th, 2015 12:45 – 13:15</p> <div>This is yet another open-source Threat Intelligence Platform, but if it actually has the STUX implementation it promises, it should be worthwhile to have a look. Regardless, there is so much that can be done in this space, and it should be interesting to see what this one does differently.</div><p><a...<a href=>Read More</a> New Website and Blog for MLSec Project Fri, 08 May 2015 00:00:00 -0700 <div>It has been almost 2 years since the MLSec Project started and it has been an incredible journey from that simple beginnings asking for data from companies to experiment on to the community of over 40 researchers that collaborate and discuss the advancements of machine learning and data science in information security.</div><div>And throughout this evolution, the original MLSec Project website stood still reflecting the old "data gathering" aspects of the project and not representing our open-source projects and growing community. Sure, you could get the information you needed from the blog, but it was necessary to put all this information front and center on our website, so it could be more easily accessed by all. </div><div>So it is with great pride that we release the new version of this site. This will be the home of our open-source development and community, and you will be also able to find all our presentations linked on the main page with ease.</div><div>The data gathering and daily reports that were offered on MLSec Project are now a part of the SaaS offering from <a href="" target="_blank">Niddel</a>. If you want to experience 2 years of research applying machine learning to threat intelligence indicator data, reach out to them to setup a trial!</div><a href=>Read More</a> RSA Conference and BSides SF Talk Roundup Tue, 07 Apr 2015 15:00:00 -0700 <p>It's RSA Conference time again, and there are some interesting talks going on this year on Moscone Center and on BSides SF. These are the talks that I and the general MLSec Project community will be attending and commenting on during this very busy week in San Francisco.</p> <p>If you have not yet registered, you can buy (very expensive!) tickets to RSA Conference <a href="" target="_blank">here</a>. Arrive early on BSides SF on Sunday if you want to get in, because pre-registration is already closed.</p> <p>Without further ado, here are our picks categotized by areas of interest. Please note we are purposefully excluding Panels and Sponsored Talks. I will probably attend some Panels so I can ask some tough questions, but I'd rather not give anyone the heads-up. :)</p> <p>Anything we missed? Let me know on Twitter!</p><h3><div>Machine Learning, Math and Friends</div></h3><p><strong><a href="" target="_blank">DNS Spikes, Strikes, and The Like</a></strong> - Thomas Mathew (OpenDNS) - BSides SF - Sunday, April 19 • 12:00pm - 1:00pm</p> <p>I have not met Thomas yet, but the OpenDNS Labs folks usually put interesting research out on the data that they have (and it is a lot of data). On first glance, it looks like an intro piece on unsupervised learning and clustering for DNS data. However, the mention of "it's not all as it seems on spikes" will probably generate an anecdote or two that should inspire feature building for DNS classifiers.</p> <p><strong><a href="" target="_blank">No More Fudge Factors and Made-up Shit: Performance Numbers That Mean Something</a></strong> - <a href="" target="_blank">Russell Thomas</a> (Large Financial Firm) - BSides SF - Sunday, April 19 • 12:00pm - 1:00pm</p> <p>Russell has been doing a lot of interesting work in...<a href=>Read More</a> The Importance of Good Labels in Security Datasets Thu, 05 Mar 2015 00:00:00 -0800 <div><span>Working as security researchers is common to create a new machine learning algorithm that we want to evaluate. It may be that we are trying to detect malware, identify attacks or analyze IDS logs, but at some point we figure it out that we need a </span><strong>good dataset</strong><span> to complete our task. But not any dataset; in fact we need </span><strong>a labeled dataset</strong><span>. The dataset will be used not only to learn the features of, for example, malware traffic, but also to verify how good our algorithm is. Since getting a dataset is difficult and time consuming, the most common solution is to get a third-party dataset; although some researchers with time and resources may create their own. Either way, most usually we obtain a dataset of malware traffic (continuing with the malware traffic detection example) and we assign the label </span><strong>Malware</strong><span> to all of its instances. This looks good, so we make our training and testing, we obtain results and we publish. However, there are important </span><strong>problems in this approach</strong><span> that can </span><strong>jeopardize the results</strong><span> of our algorithm and the verification process. Let's analyze each problem in turn.</span></div><h3><div>The need for Normal Labels</div></h3> <p>The first problem is the use of an insufficient number of types of labels. Assigning only _malware_ labels is good to look at candidate features, maybe to run some tests and to compute <strong>True Positives</strong> and <strong>False Negatives</strong>. However, without <strong>normal</strong> labels we miss a serious amount of information and we lose the possibility to compute most errors values because we can <strong>_not_</strong> compute <strong>False Positives</strong> or <strong>True Negatives</strong>. In consequence, we can not compute any error metric such as True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, F-Measure, Error...<a href=>Read More</a> On Explainability in Machine Learning Wed, 04 Mar 2015 00:00:00 -0800 <img src=//,f_auto,h_2000,q_80,w_1200/v1/255770/Screen_Shot_2015-03-05_at_12.55.36_PM_tzr0sj.png></img><br><p>A few days ago, Gartner's <a href="" target="_blank">Anton Chuvakin</a> posted an article to his blog called <a href="" target="_blank">Killed by AI Much? A Rise of Non-deterministic Security!</a>. In this post, he (rightly) points out that Machine Learning has gotten to the point where we can produce judgements that cannot be easily explained. As he points out, there are some cases where this is fine (let's see what Netflix thinks I would like to watch tonight). Other situations, though, such as deciding which connections might contain attack traffic, may incur significantly more penalty for wrong decisions. His big question is <em>My dear security industry peers, are we OK with that?</em></p><p>I am. Here's why.</p><h3><div>Is ML Deterministic?</div></h3><p>One of Dr. Chuvakin's key points is that ML is nondeterministic (that is, if you repeat it with the same input data, you may not get the same output). In my experience, nondeterminism is rare. True, there is a role for randomness in training models (they don't call it a <strong>Random</strong> Forest for nothing!) but once the model is trained and moved into production, it's usually fixed so you can reliably reproduce results (at least until the model is retrained). In other words, that model is now deterministic.</p> <p>Having a deterministic model is necessary prerequisite to being able to explain the judgements made by that model. After all, if there's any randomness in the judging, it's clearly not something you can explain.</p><h3><div>ML is Probably Not Explainable</div></h3><div>So determinism is a prerequisite for explainable systems, but it's not sufficient by itself. In fact, I believe (as does Dr....<a href=>Read More</a> ShmooCon 2015 Recap Wed, 04 Feb 2015 00:00:00 -0800 <p>While I'm a relative ShmooNoob, ShmooCon has has quickly become one of my favorite cons to attend. The quality of the talks is generally pretty high, the community is really engaged, and the con itself is always well organized. This year proved no different, so I'm left to talk about my impressions of some of the talks I was lucky enough to see.</p> <p><em>[[Alex: I was not so lucky as to go, but the ShmooCon team did a great job with their streaming setup, so I was able to catch up with most of the them from the confort of my home. So I feel like I can interject a bit here. ;)]]</em></p> <p>I was impressed with the con's dedication to talks involving Data Analysis, and I sat through six of them. I also caught a few non-Data Analysis related talks, but I'm going to be focusing on those six. Overall, the talks were really good, and each seemed to focus on a specific area of "Data Science" and were presented at varying technical levels. Only one stuck out as confusing. However, this post will only deal with the positives of each talk.</p> <p><em>[[Alex: Sconzo is such a polite person. ❤️]]</em></p> <p>Let's dig into <a href="" target="_blank">Cockroach Analysis</a>. Full disclosure, <a href="" target="_blank">Dorsey</a> is a good friend of mine and former collegue. In this talk Dorsey examined Java Class and SWF files. A good overview of the file format was provided, and from the information available in the file format he constructed various features. He also had some creative ideas with regards to feature generation. Given some sparse features of the files, how can features be engineered to help describe the file. Several classifiers were also created using Random Forest, and then used to examine over 300k files. The classifiers got pretty good results and really demonstrate how you can use these types of exercises to help find...<a href=>Read More</a> Introducing SecRepo - A Security Data Clearing House Fri, 16 Jan 2015 00:00:00 -0800 <img src=//,f_auto,h_2000,q_80,w_1200/v1/255770/clearing-house_wtdwex.jpg></img><br><p>I'm going to start off my first guest blog here at <a href="" target="_blank">MLSec Project</a> by talking about one of the biggest challenges, and a proposed solution, to getting started in Security Data Analysis. Data.</p> <p>One of the challenges with learning or performing data analysis (statistical analysis, visualizations, and machine learning) is the "getting the data" part. Where can data be obtained, is it open, does it contain known (labeled) samples? Getting data that can be used across analyses seems to be a problem within the Information Security industry due to the paranoia and tight-lipped attitude when it comes to sharing data; especially data that contains any kind of malware or breach information. However, there are lots of people willing to share at least some information.</p> <p>There have been a couple of datasets related to security that have become the de-facto standard for academic security related data/research; the KDD Cup for 1999 and 2010, but if you don’t want network or web classification data what do you do? Several sites offer specific types of data compiled from various sources. Some great examples of this are:</p> <ul> <li> <p>NETRESEC and their <a href="" target="_blank">PCAP collection</a></p> </li> <li> <p><a href="" target="_blank">Contagio malware dump</a> with their collections of labeled malware and related artifacts;</p> </li> <li> <p><a href="" target="_blank">Digital Corpora</a> who has some great forensics related data.</p> </li> </ul> <p>While it’s fantastic that people are producing data, it’s hard to find if you don’t know where to start looking.</p> <p>Enter <a href="" target="_blank">SecRepo</a>, a site that attempts to get people...<a href=>Read More</a> Happy Holidays from MLSec Project! Sat, 10 Jan 2015 00:00:00 -0800 <img src=//,f_auto,h_2000,q_80,w_1200/v1/255770/big-data-holidays_nv6xkf.jpg></img><br><p>Now that the holidays are over and everyone is slowly getting back to work (including yours truly), I think it would be a good time to update you in what the MLSec Project have been up to these last few months of 2014.</p> <p>We have been very hard at work since summer, and a lot has been going on under the hood here:</p> <ul> <li> <p>Updates and improvements to our <a href="" target="_blank">open source projects</a>. Both <a href="" target="_blank">Combine and TIQ-test</a> are being used by several individuals and organizations for research and operations.</p> </li> <li> <p>We have created a <a href="" target="_blank">community and a discussion group</a> for open-source contributors of MLSec Project and people working with data science and machine learning on information security.</p> </li> </ul> <p>I have also been delivering updated versions of the talks on several venues around the US (HushCon, SIRACon, and as a guest to some organizations) to help our industry be more familiar with data analysis and machine learning.</p> <p>In other news, I have co-founded <a href="" target="_blank">Niddel</a> to handle product development based on the research we have been doing for the past 2 years. We already have tens of private beta customers, and would love to talk to you if you are interested in trying out our machine learning-powered, personalized Threat Intelligence Platform.</p><h3><div>Looking forward</div></h3><p>The project is going to be refocused on our community and open-source initiatives, and everyone is invited to make contributions on blog posts around their research and...<a href=>Read More</a> Talk videos are now available Sun, 19 Oct 2014 00:00:00 -0700 <p>I am happy to share the videos from the talks that MLSec Project presented on DEF CON (and BlackHat USA and BSidesLV) this year. They have been available for some time now on YouTube, and here they are for you to check out.</p> <p>If you haven't seen them yet, it is a good oportunity to catch up with the research we have been conducting on Information Security data for the past few months.</p><h3><div>Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)</div></h3><h3><div>Secure Because Math: A Deep-Dive on ML-Based Monitoring (#SecureBecauseMath)</div></h3><div><span>Get in touch, and let us know what you think of the presentations! Make sure to share them, also.</span></div><a href=>Read More</a>