A few days ago, Gartner's Anton Chuvakin posted an article to his blog called Killed by AI Much? A Rise of Non-deterministic Security!. In this post, he (rightly) points out that Machine Learning has gotten to the point where we can produce judgements that cannot be easily explained. As he points out, there are some cases where this is fine (let's see what Netflix thinks I would like to watch tonight). Other situations, though, such as deciding which connections might contain attack traffic, may incur significantly more penalty for wrong decisions. His big question is My dear security industry peers, are we OK with that?
I am. Here's why.
One of Dr. Chuvakin's key points is that ML is nondeterministic (that is, if you repeat it with the same input data, you may not get the same output). In my experience, nondeterminism is rare. True, there is a role for randomness in training models (they don't call it a Random Forest for nothing!) but once the model is trained and moved into production, it's usually fixed so you can reliably reproduce results (at least until the model is retrained). In other words, that model is now deterministic.
Having a deterministic model is necessary prerequisite to being able to explain the judgements made by that model. After all, if there's any randomness in the judging, it's clearly not something you can explain.
Given the very nature of ML and the problems we use it to solve, though, I question whether "explainability" is even a reasonable expectation.
The fact is, we use ML when we operate at scales that would explode human heads ("Big Data" can be dangerous!). Compared to the type of analysis we're used to doing in the security space (network forensics, NSM alert validation, etc.), ML solves fundamentally different problems in a fundamentally different (due to scale) space.
With any reasonably complex system, the model is dealing with so many features and so many data points that the idea of translating such a complicated set of information into something a human can understand in detail is ludicrous. It's akin to trying to draw a hypercube on paper. You can't even draw the third dimension, let alone the fourth (or higher) dimensions.
That's not to say that we can't come close, though. It's up the vendors to provide documentation about the general logic of their models. For example, it's entirely within the realm of possibility for an analytics product to say something like *Based on an historical baseline of this user's communication patterns with the network, and those with similar job titles in the same location, it is likely that this new communication pattern represents a threat actor using thee user's credentials to move through the network.* You can validate this assertion by examining those baselines and the questionable activity (perhaps visualized as a time series graph) without knowing every data point in the baselines or graph.