Pattern distribution algorithm
Publication Date: 2012-Jun-18
The IP.com Prior Art Database
A two-step method for fast and efficient distribution of patterns over multiple DFAs to obtain a more compact data structure.
Page 01 of 3
Pattern distribution algorithm
Disclosed is a method for fast and efficient distribution of patterns over multiple DFAs, which is done to prevent the occurrence and/or reduce the impact of the so called state explosion problem that is inherent to the mapping of regular-expression patterns on deterministic finite automata (DFA). This problem is
well-know in the literature and arises when certain combinations of patterns that strongly interact together, are mapped on the same DFA resulting in very large numbers of states and state transitions.
In order to reduce the impact of the state explosion problem, state-of-the-art regular-expression matching engines typically distribute the patterns that are involved in a scan operation, over multiple DFA that are used in parallel to scan a given input stream. The objective is to distribute the patterns in such way over the DFA that the aggregated DFA size is (substantially) smaller than that of the single DFA that would have been obtained when all patterns were mapped together. However, due to the complexity of this problem, an optimal distribution of patterns cannot be found in reasonable time for large input sets. In fact,  shows that even
when the problem formulation is relaxed (i.e., when approximating the storage
consumption by only quantifying pair-wise pattern interactions), the problem is NP-hard .
Different heuristics have been proposed to distribute the patterns over a given number of DFA. In , a boolean interaction matrix is constructed. This matrix determines whether two patterns interact together. By definition, two patterns interact together if the resulting DFA has more states than if the two DFA were constructed separately. Based on this matrix, patterns are iteratively distributed to the DFA: each new pattern is distributed to the DFA that minimizes the number of bad interactions. Each time a pattern is distributed, the corresponding DFA is constructed. This allows ensuring that the memory consumption does not exceed a given threshold. The main drawback of this heuristic is the absence of a quantification of the pair-wise interaction of patterns. In addition, the algorithm has been shown to be quite slow , which is principally due to the computation of the
DFA occurring after each pattern distribution.
In , a simulated annealing heuristic is implemented as pattern distributor. The pair-wise interaction between patterns is quantified. In fact, a distance function is defined that directly measures the increase (or decrease) of the number of rules
when two patterns are combined together. It is shown that the use of the distance function significantly improves the overall storage. In addition, the simulated annealing method performs faster than . The main drawback of the solution  is the utilization of a a non-deterministic algorithm, which provides varying results. In addition, the inherent use of multiple heuristic parameters (e.g., number of time steps,...