The first problem is the use of an insufficient number of types of labels. Assigning only _malware_ labels is good to look at candidate features, maybe to run some tests and to compute True Positives and False Negatives. However, without normal labels we miss a serious amount of information and we lose the possibility to compute most errors values because we can _not_ compute False Positives or True Negatives. In consequence, we can not compute any error metric such as True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, F-Measure, Error Rate, Precision and Accuracy. Without these metrics it is impossible to verify and compare the performance of a given algorithm.
To compute all the error metrics we need labels for the positive class and the negative class. However, if obtaining a malware dataset is difficult, obtaining a normal dataset is much more difficult. To continue with the malware traffic example, to generate a normal dataset is not enough to execute a non-infected computer for some time. A normal dataset needs normal behaviors, which means real normal people acting normally. We should think on the normal actions used in the environment where our algorithm is going to work. A normal computer may need to check email, use P2P, chat with partners, use the network, etc.
A related issue with getting a normal dataset is that it is more time consuming than to obtain malware data. We can automate the capture of most malware, but we can't automate the capture of normal data. All the attempts at automating normal behavior are _not_ useful because humans beings can not be easily mimicked. We can't just sit down in a computer and type _as_ a normal user because our behavior will not be exactly normal and we will be adding a large bias to our dataset. Ideally, we would like to capture a normal user working but this is difficult since we need to verify that the normal users are really normal. This verification may involve checking that the normal computers are clean and checking that the owners are not attacking. In summary, we should get normal data that is representative of our working context and we should make sure that the data is in fact coming from normal sources.
Unfortunately, it is not enough to assign a malware label to all the traffic in the malware experiment and to assign a normal label to all the traffic in the normal experiment. The real origin of the traffic defines the label. Consider the case of normal traffic in the malware traffic detection scenario. We may be tempted to assign the normal label to all the traffic of the IP address of a normal computer. However, the assumption that all the traffic of a normal IP address is _normal_ is false. This is because most of the times an IP address receives traffic from a large number of unknown computers in a network. For example, it may receive attacks from computers in the Internet, receive attacks from the internal network, it may receive broadcast NetBIOS request from infected computers, it can have its ports scanned, or it can just receive harmless ARP packets from an incorrectly configured router. All this traffic is coming to and from the normal IP, but it is certainly not a result of a normal action. Even more, the normal computer generates traffic in response to those requests, but it should not be considered normal since it was not generated as a result of a normal user activity. So, if this traffic is not normal or malware, what is it? It is background traffic and as such should be given the Background label.
The same idea must be applied to the malware traffic because is common to find assumptions such as: if it was a malware capture, all the traffic is malware. However, a computer infected with the Zeus malware may be still by attacked by an external Bobax bot or receive normal multicast packets from the local printer. The issue of the origin of traffic suggests that it may be a good idea to differentiate the traffic going to an IP address and the traffic coming from an IP address. This is a first approach to separate the normal traffic, the malware traffic and the background traffic.
Our last problem to consider is the balance of the dataset. It is usually the case that we have more data with the positive label than with the negative label. This is due to the difficulty of obtaining normal data. However, this does not usually reflect a real environment. In a malware traffic scenario most of the traffic is normal and only some small amount is malicious. If our dataset is not correctly balanced, our classification algorithm may benefit the majority class . Although this is related with the base rate fallacy, it suggests that a wrongly balanced dataset may result in biased error metrics.
A wrongly labeled dataset may result in algorithms using the wrong training features, in an insufficient number of metrics reported, in highly biased detection results, in wrong performance metrics and in a mistaken idea of how good the algorithm is. The task of correctly labeling our dataset may seem time consuming but it can teach us a lot about our data and it will provide us with a strong base to evaluate our algorithm.
To deal with most of these problems it may be a good idea to have a labeling methodology. The methodology should at least include a way of testing for the origin of data, the IP addresses in the dataset, the malware types, the amount of labels and the uniqueness of labels. Such a methodology may also assist in the semi-automatic assignment of labels if we use, for example, the ralabel tool of the Argus suite that labels network flows using rules. Such a methodology may help us avoid most of the common issues of labeling a dataset.
 Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets : A review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25–36.
We just sent you an email. Please click the link in the email to confirm your subscription!
OKSubscriptions powered by Strikingly