While I'm a relative ShmooNoob, ShmooCon has has quickly become one of my favorite cons to attend. The quality of the talks is generally pretty high, the community is really engaged, and the con itself is always well organized. This year proved no different, so I'm left to talk about my impressions of some of the talks I was lucky enough to see.
[[Alex: I was not so lucky as to go, but the ShmooCon team did a great job with their streaming setup, so I was able to catch up with most of the them from the confort of my home. So I feel like I can interject a bit here. ;)]]
I was impressed with the con's dedication to talks involving Data Analysis, and I sat through six of them. I also caught a few non-Data Analysis related talks, but I'm going to be focusing on those six. Overall, the talks were really good, and each seemed to focus on a specific area of "Data Science" and were presented at varying technical levels. Only one stuck out as confusing. However, this post will only deal with the positives of each talk.
[[Alex: Sconzo is such a polite person. ❤️]]
Let's dig into Cockroach Analysis. Full disclosure, Dorsey is a good friend of mine and former collegue. In this talk Dorsey examined Java Class and SWF files. A good overview of the file format was provided, and from the information available in the file format he constructed various features. He also had some creative ideas with regards to feature generation. Given some sparse features of the files, how can features be engineered to help describe the file. Several classifiers were also created using Random Forest, and then used to examine over 300k files. The classifiers got pretty good results and really demonstrate how you can use these types of exercises to help find net-new "interesting" files in your environment. My favorite part of the presentation? Code and examples were released.
[[Alex: The Click Security team talks are always good, because they delve into the feature generation and data collection parts with some detail. The feature generation process is usually where the real "magic" happens, and it is good to have people outside of academia discussing their processes and challenges.]]
Next up: Practical Machine Learning. This was a great high-level talk. Terry did a really nice job summing up some things that are possible with machine learning as well as the different applications of supervised vs. unsupervised learning. A couple of case studies were covered: DGA detection and Command and Control detection as well as how Machine Learning was able to lead to successes in those scenarios. A refreshing part of Terry's talk was also some of the pitfalls and gotchas involved in Machine Learning. No code was released, but the two presentations linked were mentioned (hopefully I linked the correct ones).
[[Alex: I enjoyed this talk in general, mostly because it pointed me to the papers. The "practical" part of the title without presenting code is a bit misleading, but it went through all the important fundamentals throughout the exposure. I hope for a day when we don't have to explain what are the differences between supervised and unsupervised learning, but that was not that day.]]
Manually Searching Advisories and Blogs for Threat Data--"Who's Got Time for That?" was awesome and personally disappointing. Both Elvis and Shimon did a fantastic job. They developed a really interesting system that uses Natural Language Processing (NLP) to digest various threat reports and programmatically extract IOCs, context, mitigations, etc... from them. These guys set up a slick system, trained NLP models, used various ways to score what they pulled out from documents, and then skinned the entire thing in what seemed to be a pretty easy to use web site. Then, the part that made me sad, no code was released. However, I was super jazzed about them talking about using NLP this way and how you could develop your own solution. We'll call it even for now. *cough* Release the project *cough*
[[Alex: I mostly agree. I had conversations previously with some people (including Sconzo), about why don't we have something like this, and was stoked to see this being talked about. Sadly, no code, so it's back to the drawing board for us. :)]]
The next talk, White is the New Black, tackles the greatest hurdle when doing any type of Data Analysis and Machine Learning. How do you get good labeled data as input for any models you're building? Irena spoke about a methodology developed at Check Point in order to help them gather known White (non-malicious) data as well as how they were able to increase confidence in Black (malicious) data as well. It's great to get exposed to others methodologies. Not only does it afford us an opportunity to learn as a community, but it gives us insight into how much others are following a scientific process.
[[Alex: I loved this talk, was by far my favorite. This is a VERY REAL problem in the field, and it was great to see the perspective of a different team going through this.]]
Infrastructure Tracking with Passive Monitoring and Active Probing was another end-to-end coverage of the process of collecting data, through analysis, and onto action. Various methods of data gathering and augmentation were covered as they related to Malware infrastructure. One of my favorite parts about it was they covered tying file and network data together to build actionable information and output. Anthony and Dhia also presented case studies on a Zbot proxy network as well as GameOver Zeus. One of the more accessible points of this talk was how you could leverage their techniques and code even if you don't have the massive amounts of data that OpenDNS has. Now get out there and get tracking!
[[Alex: Just like the Click Security talks, the OpenDNS talks are usually very interesting. I honestly don't find the 3D graphics stuff they do that helpful or informative (#3Dpewpew), but these data analysis incursions on the DNS usage data they hold are great! I'm a personal friend of Dhia and AK, so as I am very biased and will keep from giving additional commentary.]]
The last talk I saw was The Dark Art of Data Visualization. This presentation covered a couple of different visualization programs that would display data in real-time (vs. being a static graph). David has been visualizing network within ShmooCon Labs, and covered his use of glTail and Skyrails. Some code was also released so you can use either of these tools to view Bro log data in a new/different way.
[[Alex: Didn't really catch this one. The code must be worth checking out though.]]