Efficient solution for providing XML Schema "(nested) character class subtraction"

IP.com Disclosure Number: IPCOM000179548D
Original Publication Date: 2009-Feb-17
Included in the Prior Art Database: 2009-Feb-17
Document File: 2 page(s) / 26K

This invention is on an efficient solution for providing "(nested) character class subtraction" to XML Schema.

XML Schema Datatypes (http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/) comprise so called character classes, used mainly for

like restrictions of

's. For example the character class of "character or digit" is given by "[A-Za-z0-9]".

Sometimes it is needed to filter out some values or ranges from the character classes, which is called "character class subtraction": http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#nt-negCharGroup This may be found in another spec under the term "A - B": http://www.w3.org/TR/REC-xml/#sec-notation

For example the term "[a-z-[aeiou]]" matches all consonants. A more complicate and realistic example is this description of British postcodes ( http://en.wikipedia.org/wiki/UK


"(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][ A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})" This matches e.g. SW8 2LP.

This article discusses "character class subtraction" and states that only 2 engines (JGsoft and .NET) support this XML Schema feature: http://www.regular-expressions.info/xmlcharclass.html#subtract Even more, this page describes the "nested character class subtraction" aspect of the spec, which is not implemented anywhere: [0-9-[0-6-[0-3]]] is the same as [0-37-9].


A typically option for implementing XML Schema pattern handing is to use translation to the PCRE library (http://www.pcre.org/pcre.txt). The base idea of the invention is in addition to the normal use of PCRE library:
prepending of each subtracted character class with 7 characters of fill in (#######) to the surrounding character class
local replacement of the fill in by making use of the "negative look ahead (?!...)" and "non capturing groups (?:...)"...