Martin, Xavier
The general problem we consider in this thesis is the following: we have to analyze a stream of data (records, packets, events ...) by successively applying to each piece of data a set of ``rules'. Rules are best viewed as lightweight parallel processes synchronizing on each arrival of a new piece of data. In many applications, such as signature-based intrusion detection, only a few rules are concerned with each new piece of data. But all other rules have to be executed anyway just to conclude that they can ignore it. Our goal is to make it possible to avoid this useless work completely.
To do so, we perform a static analysis of the code of each rule and we build a decision tree that we apply to each piece of data before executing the rule. The decision tree tells us whether executing the rule or not will change anything to the global analysis results. The decision trees are built at compile time, but their evaluation at each cycle (i.e., for each piece of data) entails an overhead. Thus we organize the set of all computed decision trees in a way that makes their evaluation as fast as possible.
The two main original contributions of this thesis are the following. Firstly, we propose a method to organize the set of decision trees and the set of active rules in such a way that deciding which rules to execute can be made optimally in O(r_u), where r_u is the number of useful rules. This time complexity is thus independent of the actual (total) number of active rules. This method is based on the use of a global decision tree that integrates all individual decision trees built from the code of the rules.
Secondly, as such a global tree may quickly become much too large if usual data structures are used, we introduce a novel kind of data structure called sequential tree that allows us to keep global decision trees much smaller in many situations where the individual trees share few common conditions. (When many conditions are shared by individual trees the global tree remains small.)
To assess our contribution, we first modify the implementation of ASAX, a generic system for data stream analysis based on the rule paradigm presented above. Then we compare the efficiency of the optimized system with respect to its original implementation, using the MIT Lincoln Laboratory
Evaluation Dataset and a classical set of intrusion detection rules. Impressive speed-ups are obtained.
Finally, our optimized implementation has been used by Nicolas Vanderavero, in his PhD thesis, for the design of stateful honeytanks (i.e., low-interaction honeypots). It makes it possible to simulate tens of thousands hosts on a single computer, with a high level of realism.
Bibliographic reference |
Martin, Xavier. Fast sequential implementation of a lightweight, data stream driven, parallel language with application to intrusion detection. Prom. : Le Charlier, Baudouin |
Permanent URL |
http://hdl.handle.net/2078.1/6253 |