Browse Prior Art Database

Capturing Matched Text and Backreferencing for Deterministic Finite Automation (DFA) Regular Expression Engines

IP.com Disclosure Number: IPCOM000014222D
Original Publication Date: 2000-May-01
Included in the Prior Art Database: 2003-Jun-19
Document File: 2 page(s) / 72K

Publishing Venue

IBM

Abstract

Abstract:

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

  Capturing Matched Text and Backreferencing for Deterministic Finite Automation (DFA) Regular Expression Engines

Abstract:

Capturing matched text can be accomplished in a DFA engine by maintaining a group-table per state to hold matches associated with each level of parenthesis grouping. That technique can be expanded to support backreferencing by adding objects to the abstract syntax tree that perform matching of states based on group-table entries.

Background:

Capturing Matched Text:

A Regular Expression engine interprets a previously compiled expression (following Regular Expression syntax) to search for patterns in a data-stream. For example, the regular expression:

(T[io]m) (Wh?ite)

will search for a matching sequence of the letter T , followed by a single character ( [ ] ) from the group io , followed by m , followed by a space, followed by the character W , an optional ( ? ) character h , and the sequence ite . The contents of the first name matched ( ( ) ) and second name matched ( ( ) ) will be remembered as well in order of opening parenthesis used.

This expression matches Tom White and Tim Wite and others. In addition, the expression will remember that Tom was the first group matched and White was the second group matched in Tom White.

Backreferencing:

Backreferencing utilizes matched-text capture to enhance search capability. For example, the regular expression:

(\sthe){2,}

will search for two or more sequential occurrences ( {2,} ) of the sequence of one white-space ( \s ) followed by the literal text the . This expression could be used to locate the the in the line "Go to the the store."

Backreferences utilize a previous match as the match criterion for some later part of the expression. This capability permits the expression to be generalized to look for words repeated by using the backreference:

(\s[a-z]+)\1+

This expression says to search for the first group ( ( ) ) starting with the sequence of one white-space ( \s ) followed by one or more ( + ) sequential lowercase alphabetic characters ( [a-z] ) followed by one or more sequences of ( + ) the first group matched ( \1 ) .

Deterministic Search:

A Deterministic Finite Automation (DFA) engine applies the compiled expression against the string to be tested for matches in a way that finds all possible successful match states up to the point in the test string that it has evaluated. Evaluation proceeds from the start to the end of the test string without backtracking. This behavior distinguishes the DFA engine from an Non-Deterministic Finite Automation (NFA) engine which does not necessarily find all match states and relies heavily on sequential processing including backtracking.

1

Page 2 of 2

The construction of the DFA engine is well documented. (Please refer to Design Patterns: Elements of...