Recently, I am studying on intrusion detection with machine learning, hoping to use what I have learned this semester in Machine Learning course to solve some real life problems. The first topic is using Hidden Markov Models to detect abnormal input for parameters.
Abnormal Parameters Input
In normal situations, most of URLs are in this forms.
www.xxx.com/index.php?id=1 www.xxx.com/shownews.php?ArticleID=65535 www.xxx.com/user.php?username=hazzel&_token=20fh8nv5f983mc9c2
For most of normal applications and programs, since parameters in URLs are used to transfer information to servers, they have some obvious patterns. When you specify an article with ID, the input will be a number; when you fill a field for username, it will always be a string probably with numbers and little certain symbols. We can say that normal parameters have their patterns. If we can find these patterns, then we may be able to decide what kind of URLs are abnormal without subject to the patterns.
How can we find these pattern?
Hidden Markov Models
Hidden Markov Models (HMM) is a sort of models which is good at analyze sequences of states and their outcomes. Here is a graph which may help you better understand HMM easily.
As you can see in the graph, there are two states (Rainy and Sunny) and three outcomes (Walk, Shop, Clean). Possibilities of transiting from states to states and states to outcomes are specified.
With this model, we can draw some conclusions. i.e. If today is rainy, then tomorrow will be more possible to be still rainy, and you are more willing to clean at home stead of walking outside and kissing the rain.
Actually, Markov models are powerful probabilistic tools to analyze sequential data. They are already widely used in weather prediction, input prediction, stock market. Do they also works on discover the pattern of parameter inputs?
HMM are powerful at sequence analysis. While the parameter inputs are also sequences, they also works in this field. We can do in this way.
Let us imagine that we have a series of inputs.
www.xxx.com/index.php?tag=413jia www.xxx.com/index.php?tag=293mcd www.xxx.com/index.php?tag=123ABC ... (numerous URLs)
As a human, we can recognize a normal input is in the format of “num num num char char char”. Then if we get URLs like the below, we can realize they are abnormal.
www.xxx.com/index.php?tag=123_<>BC123 www.xxx.com/index.php?tag='+UNION+SELECT www.xxx.com/index.php?tag="><script>alert(1)</script>
In the human’s mind, normal URLs are subject to the pattern “num num num char char char”, that is, numbers are followed by numbers and chars, and chars are followed by chars. If chars are followed by symbols, it is abnormal.
Transfer it into the mathematical logic. Just use a likelihood thinking. For normal URLs, P(num->num) and P(num->char) is large and P(char->symbol) is very small. What are the values for these likelihoods? We can analyze all normal inputs and generate HMM.
Like humans’ thinking, we generalize inputs first.
- numbers: N
- characters: A
- symbols: S
- others: O
Train with HMM
We can do it manually. Just calculate transition probabilities(likelihood) with training set. Of course, we can use Python library hmmlearn .
Given a string of input “123*<>ABC!@#”.
- generalize the input to “NNNSSSAAASSS”.
- calculate the likelihood. Since the likelihoods of transition from N to S and A to S are small, the likelihood for this string described as a normal input will be very small.
- Give a score by normalization.
Do you find it hard to understand? Actually, the idea is very simple and clear. The core idea is to calculate the likelihood of a string being described as a normal input with the transition of elements inside the string.
Still do not understand? It’s alright. I wrote some codes to help you understand this detection method.
In function test() , you only need to train a dataset with the fit() for one time. Then you can comment this line of code because the profile is already saved. The ranging() is for calculating the range of the scores for normal URLs. If an input URL got a score not in the scope, it can be regarded as abnormal one. For example, the score for the input case here is -237934.133924, while the range with the data I provide is (9.39570102156784, 30.602823145885385). That is to say, this URL is abnormal.
def test(): trainer = Trainer() trainer.fit() trainer.ranging() score = trainer.predict('www.xxx.com/index.php?id=<script></script>') print(score)