When you given a set of commands from a log file such as .bash_history
or something similar, you can definitely judge if this set of commands reveals a evil attack to your computer system by reading it lines by lines if there is not too many commands. However, for those companies, there are such many log files waiting to analyze that it is not possible to audit manually.
What if the computer system can detect the threats by auditing log files automatically? In this way, pattern recognition is brought into the automatic log auditing. It sometimes can be regardes as part of basic steps of machine learning. One of the methods or algorithms is called K-Neareast Neighbors, or KNN.
How KNN works
Let us say that, some day, a man gave me a bottle of juice without telling me which kind of juice it is. He asked me to make a guess. How did I do? I analyzed the color, smell, and flavor, and then draw a correct conclusion. I think most of people will do it in the same way. But why does this method works?
Actually, it is a easy question. Since color, smell, and flavor are three important properties for juice, if they are similar to a certain kind of juice, then you will classify it as that kind of juice.
Mathematically speaking, let us assume there are values for those three properties (color, smell, flavor).
JuiceA = [0, 0, 0] JuiceB = [1, 1, 1]
If you can know your juice is yourJuice = [0.1, 0.1, 0]
, it is much more possible that you juice belongs to JuiceA
. The Euclid distance from your juice to JuiceA is sqrt((0-0.1)^2+(0-0.1)^2+(0-0)^2
, which much smaller to JuiceB. So you believe your juice is JuiceA.
This is exactly how KNN works. When you are given a point (x, y) to predict the type of it, you can just find k nearest points, and draw the conclusion that the type of the given point belongs to the same type of most of k nearest points do.
In the figure above, the green point belongs to the same type as red triangles do.
Extract Features
In the example of juice above, we extract the feature of color, smell, and flavor. The feature is like a three-dimension vertor.
If you are extracting features from flowers, you may pay attention to the color, shape or size of the flowers. If you are extracting features from dogs, you may focus on the color, size of body, shape of ears and eyes, and things like these.
What if you are given a set of system commands and asked to judge if it is used for evil things? Similarily, you can extract features of count of each commands.
Code
I have finished some codes and share them on Github. Here is the location. https://github.com/zhzzhz/MachineLearningWebSec/tree/master/KNN
With the dataset provided by http://www.schonlau.net, I implement a class
which can train, predict, and score the accuracy.
Train
def train(self, file_id): """ Do training. """ feature_list = list() commands_list = self._load_commands(file_id) label_list = self._load_label(file_id) for commands in commands_list: feature = self._get_feature(commands, self.all_set) feature_list.append(feature) self.classifier.fit(feature_list, label_list) return
Do training with commands sets and related results.
Score
def score(self, file_id): feature_list = list() commands_list = self._load_commands(file_id) label_list = self._load_label(file_id) for commands in commands_list: feature = self._get_feature(commands, self.all_set) feature_list.append(feature) return self.classifier.score(feature_list, label_list)
Scoring with other datasets.
Predict
def predict(self, feature_list=None): if feature_list is None: commands_list = self._load_commands(50) feature = self._get_feature(commands_list[0], self.all_set) feature_list = [feature] return self.classifier.predict(feature_list)
Predict the result based on the specified command list.
Let us set an example.
First off, we use User6
to train the classifier. Then we use User9
to test. Let us see the score of the performance of the classifier.
def main(): knn = KNNCommandClassifier() knn.train(6) print(knn.score(9))
The output:
0.9133333333333333
It seems the classifier works good.