Class Regex

Introduction

This class provides Regular Expression matching and substitution for Synergy/DE.

Contents

  1. Introduction
  2. Contents
  3. Explanation of symbols used
  4. Member reference
    1. Continue - don't reset ContinueFrom on a failed match
    2. ContinueFrom - position of end of last match for \G
    3. DotMatchesNewline - dot matches newline option
    4. escape - escape special characters in a string
    5. ExplicitCapture - explicit capture option
    6. Extended - extended syntax option
    7. from - create a Regex
    8. GlobalSearch - global search option
    9. IgnoreCase - ignore case on characters
    10. LastMatch - result of the last match performed
    11. match - match a string
    12. Multiline - multi-line search option
    13. replace - perform substitution
    14. split - split a string at an expression
    15. ToString - a string representation of the Regex
  5. Supporting classes
    1. MatchData - result of calling Regex's match method
    2. RegexException - thrown on a syntax error
  6. Syntax for Regular Expressions
    1. Options
    2. Pattern characters
      1. $ - anchor to end of line
      2. ( ) - group sub-expressions
      3. (?# ) - comment
      4. (?() ) - conditional group
      5. (?: ) - group sub-expressions without back-reference
      6. (?< >), (?' ' ) - named group sub-expression (.NET syntax)
      7. (?<= ), (?<! ) - look-behind
      8. (?= ), (?! ) - look-ahead
      9. (?imnsx) - dynamically set options
      10. (?P< >) - named group sub-expression (Python syntax)
      11. (?P= ) - named group back-reference
      12. * - zero or more
      13. + - one or more
      14. . - any single character
      15. ? - zero or one
      16. ? - non-greedy match
      17. [ ] - character class
      18. \ - escape a special character
      19. ^ - anchor to beginning of line
      20. { } - minimum/maximum repetitions
      21. | - union (or)
    3. POSIX character classes
  7. Features not yet implemented
  8. Features that will never be implemented, and why

Explanation of symbols used

Words in italics indicate an instance of a class. The word corresponds to the class name, except where more than one instance is represented in the same statement. In that case a number (2, 3, etc.) is appended to the class name.

Words in normal typeface are to be taken literally (required punctuation, class name in a static reference, method name, etc.)

The symbol => is used to separate an expression (on the left) from its return value (on the right).

An ellipsis (...) indicates that the previous argument may be repeated any number of times. The description will indicate whether one instance is required.

Member reference

Continue

Regex.Continue => boolean
Regex.Continue = boolean

This option controls how ContinueFrom is set when a match fails. If true, ContinueFrom is not modified by a failure, otherwise it is set to 0 (the beginning of the string). This affects the operation of \G after a failed match. It may be set to true initially via the "c" option on the expression, or you may alter it directly.

ContinueFrom

Regex.ContinueFrom => int
Regex.ContinueFrom = int

This value is used to anchor the \G special character. It is set to 0 initially, which means that \G will anchor to the beginning of the string. After a successful match, ContinueFrom is set to the end of the match. On a failed match, it is reset to 0, unless the Continue option is true.

DotMatchesNewline

Regex.DotMatchesNewline => boolean
Regex.DotMatchesNewline = boolean

DotMatchesNewline controls how a period matches the newline character (char(10)). It may be set initially to true via the "s" option on the expression, or you may alter it here.

When this option is enabled, period matches newline -- otherwise it does not.

static method escape

Regex.escape(a) => string
Regex.escape(a, a2) => string

This method returns string as a copy of a with all special characters preceeded by a '\'. If a2 is passed, then its first character is treated as the expression delimiter (which is also escaped in the result). Otherwise, '/' is assumed.

Example:

assert_equal('a\+b\/', Regex.escape('a+b/'))

ExplicitCapture

Regex.ExplicitCapture => boolean

This option controls how group capture back-references are numbered. When true, plain parentheses do not allocate a back-reference -- only named groups get numbered. This option is false by default, but may be initially set to true by the 'n' option in the expression. Altering the value of ExplicitCapture after a Regex has been instantiated has no effect, because the option only affects compilation.

Extended

Regex.Extended => boolean

Extended syntax mode causes the parser to ignore white space and end-of-line delimiters within the pattern or options. Additionally, anything between a '#' and an end-of-line delimiter will be ignored as a comment. This option is false by default, but may be set by the 'x' option in the expression. Altering the value of Extended after a Regex has been instantiated has no effect, because the option only affects compilation.

Restrictions: a comment in the options section may only occur after the 'x' option, and may not include the pattern delimiter.

Comments in an expression are not very useful in Synergy/DE, because you can always continue a literal in Synergy/DE with a comment in between in the code. For example:


        expr = R$('/^'        ;beginning of line
	&         'abc|123'   ;Jackson five
	&         '/x')       ;extended mode
      

The above example doesn't even need extended mode, because the concatenation of the continued lines doesn't contain any white space or comments. I have implemented support for comments in extended mode only for thoroughness and compatibility with expressions that might be imported from another language or a text file. Note that 'end of line' for a comment within an expression means the next line-feed, vertical-tab, form-feed, carriage-return, or the end of the expression -- not the end of the Synergy/DE source line.

static method from, macro R$, constructor

Regex.from(object) => regex
R$(object) => regex
new Regex(object) => regex

Creates a new Regex from the string representation of object, which may be an alpha expression or any class of object. The resulting string will be compiled immediately, which may throw a RegexException if the syntax is incorrect.

If you .include "synthesis.def", then you may use R$ as a syntactic shortcut for Regex.from.

GlobalSearch

Regex.GlobalSearch => boolean
Regex.GlobalSearch = boolean
GlobalSearch controls whether the replace method will continue to replace text until a match is no longer found. It can be set to true initially with the "g" option in the expression, but you may alter it afterwards here.

IgnoreCase

Regex.IgnoreCase => boolean
Regex.IgnoreCase = boolean
IgnoreCase controls case-sensitivity on matches. The "i" option in the expression can be used to set this value to true initially, but you may change it afterwards by altering this member.

LastMatch

Regex.LastMatch => MatchData
LastMatch contains the MatchData object returned by the last call to match. Note that match is also called by replace, and if the GlobalSearch option was enabled, the last result within replace will be unsuccessful.

method match

Regex.match(a) => MatchData
Regex.match(a, int) => MatchData
Matches the regular expression against a, returning a MatchData object describing the match (or lack thereof). If no match occurred, MatchData's start member is 0. In the second form above, int specifies the beginning position within a for consideration of matches -- but in any case, MatchData's members (start, end, before, etc.) refer to the entire string.

Multiline

Regex.Multiline => boolean
Regex.Multiline = boolean

Multiline controls how match (and consequently, replace as well) treat a newline character (char(10)) embedded in the string being searched. It may be set initially to true via the "m" option on the expression, or you may alter it here.

When Multiline is true, a newline character marks an end of line and a beginning of line, so $ matches its position, and ^ matches the position following. The newline itself can still be matched as a character, but see the DotMatchesNewline option for matching it with a period.

Note that even when Multiline is enabled, you can anchor to the absolute beginning or end of the string without regard to newlines by using \A and \Z, respectively.

method replace

Regex.replace(a, a2) => string
Regex.replace(a, a2, int) => string

Replaces the substring of a that matches the regular expression with a2. If the GlobalSearch option is enabled (the 'g' option on the expression), then all matches will be replaced -- otherwise only the first one will be.

In the second form, int specifies the beginning position within a at which substitutions may occur.

If a2 contains any '\' or '$' characters, the following special substitutions will be performed, left to right:

EscapeTranslation
\0, $0 or ${0}the text matching the entire expression.
\1 thru \9, or
$1 thru $9, or
${1} thru ${2147483647}
the text matching the corresponding parenthesized sub-expression.
\& or $&the text matching the entire expression.
\` or $`the text before the match.
\' or $'the text after the match.
\+ or $+the text matching the highest-numbered matched sub-expression.
\g<group> or
${group}
the text captured in named group group.
\x00 thru \xFFthe corresponding ASCII character
\ followed by any other characterthat character (including \ or $).
$$$
$ followed by any other character$ followed by that character

method split

Regex.split(a) => ls
Regex.split(a, boolean) => ls

This method searches for a match of the regex within a, and returns a list of the portions of the string that occur before and after that expression. If the GlobalSearch option is enabled, then a will be split at every occurence of the expression, otherwise only at the first occurence.

If the expression contains any capturing groups, the contents of those groups will also be returned in ls, wherever they occurred.

If boolean is passed and false, then no empty strings ('') will be returned. The default is to return empty strings. For example, r$('/x/').split('xy') returns a list containing '' and 'y', whereas r$('/x/').split('xy',false) returns a list containing only 'y'. This also applies to empty strings within captured groups.

override method ToString, explicit alpha cast

Regex.ToString() => string
(a)Regex => string
Returns the original string used to construct the Regex, regardless of any changes to options after construction.

Supporting classes

Class MatchData

A member of this class is returned by the match method, and is also available in the public member LastMatch. It describes a regular expression match, or lack thereof.

Public members:

  • [int] => MatchData - (Indexer) subexpression matches ([0] = this, [1] = \1, etc.). If no match exists for a selected sub-expression, the returned MatchData object will have start and length set to 0 and matched will return "".
  • [a] => MatchData - (Indexer) named group matches. If no such named group was captured, the returned MatchData object will have start and length set to 0 and matched will return "".
  • after => string - the portion of source following the matching substring, or "" if nothing follows or no match occurred.
  • before => string - the portion of source before the matching substring, or "" if nothing preceded or no match occurred.
  • Count => int - the number of elements accessible through the Indexer. I.e., the highest matched sub-expression plus one.
  • end => int - the index of the last character of the match within source.
  • length => int - the length of the matching substring. Zero if no match occurred, or the empty string matched.
  • matched => string - the matching portion of source, or "".
  • replace(string) => string - replace \1..\9, etc. with sub-matches. This is called automatically by Regex's replace method.
  • source => string - the entire original string searched.
  • start => int - the starting index of the match within source, or 0 if no match occurred.
  • ToString() => string - a string representation of the match information.

Class RegexException extends Synergex.SynergyDE.SynException

This exception will be thrown upon construction of a Regex if the expression contains any syntax errors. Its Message member contains the message "Error parsing regular expression: " followed by the text of the expression (or a portion thereof), followed by a more detailed description of the problem:
  • "Empty expression not allowed" - only an initial delimiter was supplied.
  • "Unknown option" - one of the options was not g, i, m, n, s, or x.
  • "Missing operand" - an operator didn't have an operand (e.g. '/*/')
  • "Final \ encountered" - the expression ended with '\'
  • "\c expects a letter" - the character after '\c' was something else.
  • "\x expects two hex digits" - we didn't get them.
  • "Empty [] not allowed" - an empty character class was encountered.
  • "Missing ]" - a character class wasn't terminated.
  • "Character range requires two characters" - [-], [A-], or [-B].
  • "Invalid character range" - [\s-\d] or [Z-A], for instance.
  • "Unmatched )" - at least one too many close parentheses.
  • "Missing ? sub-operator" - encountered "(?" at the end of the pattern.
  • "Unknown ? sub-operator" - encountered "(?" followed by an unrecognized sub-operator.
  • "Missing ?P sub-operator - encountered "(?P" at the end of the pattern.
  • "Unknown ?P sub-operator" - encountered "(?P" followed by an unrecognized sub-operator.
  • "Malformed named group" - "(?P<" was not followed by at least one character and a ">".
  • "Malformed named group reference" - "(?P=" was not followed by at least one character and a ")".
  • "Malformed {n,m} specification" - "{" was not followed a "}".
  • "Number expected" - "{ }" contained something other than digits or a comma.
  • "Malformed conditional" - the inner parentheses of a conditional group were empty or unmatched.
  • "Unrecognized ?< sub-operator" - "(?<" was followed by something other than "=" or "!".
  • "Unrecognized POSIX character class" - the name specified does not match a known class

Regular Expression syntax

The string from which a regular expression may be constructed must have the general form:

<delimiter><pattern><delimiter><options>

where:

<delimiter> is any single character except '\'. Both delimiters must match.
<pattern> is the regular expression pattern specifying the match.
<options> is zero or more of the following single-character search options:

Pattern characters

Most characters contained within <pattern> must follow one another immediately in the target string in order to match. For instance, R$('/xyz/') matches 'abcdefghijklmnopqrstuvwxyz' at position 24, but it does not match 'xzy' or 'x y z'.

Some characters, however, have special meaning within a regular expression. If you want to include those characters literally, you must escape them with a preceding '\'. You can also include your delimiter in the same way -- e.g., R$('/\//') matches a '/'. Following is a list of all special characters supported by this implementation, and their meanings.

$ - anchor to end of line

The match must end at the last character of a line of text in order to be accepted. If Multiline is true, then an embedded newline character qualifies as marking the end of a line, but the end of the string always qualifies.

Example: R$('/end$/') matches 'friend', but not 'friends'.

( ) - group sub-expressions

Parentheses can be used to override the usual operator precedence by grouping operations together. Additionally, parenthesized sub-expressions are counted from left to right (by their open parenthesis) to number them from 1 to the number of sub-expressions encountered, unless the ExplicitCapture option is enabled. The text that matched each sub-expression can be accessed later on within the pattern by using \1 through \9, or from the MatchData object returned from the method match by indexing it. Thus for example,

match = R$('/.*(c)/').match("abcd")
assert(match[1].start == 3)
assert(match[1].length == 1)

Additionally, in the replace method, the replacement string may contain escaped references to these sub-expressions.

If a sub-expression is completed more than once within a match, the last one wins.

(?# ) - comment

If a parenthesized group begins with "?#", then everything up to the next close parenthesis is taken as a comment and ignored. Note that this is the next parenthesis, not the matching one. You cannot include a close parenthesis in this type of comment (but see the x option).

This class of comment, although matching no input, is taken to be an operand. Thus, it cannot separate an operand and its intended operator. For example:

R$('/a(?# a comment)*/').match('aa')

will have a length of 1, because the '*' repeats the comment, not the 'a'. However,

R$('/a(?# a comment)b/').match('ab')

does match, because the comment matches zero-length text.

(?() ) - conditional group

If the condition within the inner parentheses is true, then the expression that follows up until the final parenthesis applies. If that expression contains a union (|), then the left side of the union will be treated as the 'then' section (applied if the condition is true), and the right side will be treated as 'else' (applied if the condition is false). To force a union to remain a union, enclose it within another level of parentheses.

A conditional group does not consume a numbered back-reference.

The condition can have one of three forms:

  1. Numbered group reference - if all of the characters in the condition are numeric, then the corresponding captured group will be tested for membership (as in a \1..\9 back-reference). If that group has been captured, the condition is true, otherwise it is false. Note that a zero-length group is counted as captured.
  2. If the condition begins with "?", then it must be a look-ahead or look-behind test. If that test passes, then the condition is true, otherwise false.
  3. If neither of the above, then the condition is taken as the name of a captured named group. If that group has been captured, the condition is true, otherwise false.

Examples:

R$('/(John )?\bSmith(?(1)son)\b/') - matches 'Smith' alone, but if preceeded by 'John ' then it must be 'Smithson'.

R$('/((?(dot)\.)(?P<dot>\d{1,3})){4}/') - matches an IP address, because the \. is not applied until a numeric group has been captured.

R$('/(?:(x)|y) = (?(1)0|-1)/') - matches "x = 0" or "y = -1", but no other combinations.

R$('/\d+(?(?=.*\boverdrawn\b)\*)/i') - matches a number, which must include a trailing '*' if the word "overdrawn" occurs later on in the string.

R$('/\d(?(?<=[0-5])A|B)/') - matches "0A", "1A", "2A", "3A", "4A", "5A", "6B", "7B", "8B", and "9B"

(?: ) - group sub-expressions without back-reference

If a parenthesized group begins with "?:", then it operates exactly like a normal sub-expression, except that it can't be subsequently referenced by a sub-expression number. That means that the MatchData object returned by match does not include that sub-expression in its indexer, nor can it be referenced later in the pattern via \1..\9, nor accessed within the a2 argument to replace. The sub-expression is not included in the count of sub-expressions, so the next counted sub-expression will have the index following the previous one.

Example:

match = R$('/(?:a|l)*(.*)tor/').match('alligator')
assert(match[1].start == 4)
assert(match[1].length == 3)

To make this behavior the default for unadorned parentheses, see the ExplicitCapture option.

(?<group> ), (?'group' ) - named group sub-expression (.NET syntax)

For compatibility with .NET named group capture, I've included support for the .NET syntax. .NET named groups differ from the Python style named groups not only in syntax, but also in the assignment of numbered back-references. While Python groups are numbered along with unnumbered groups, .NET groups are numbered after all other groups -- unless they duplicate a Python group name, in which case they get the number of the group they duplicate. As implied by that exception, .NET group names share the same namespace as Python named groups. They can be used interchangeably, and you can use either syntax for back-references and text-replacement.

Examples:

match = r$("/(?'first'a)(b)(?<second>c)/").match('abc')
assert_equal('a', match['first'].matched)
assert_equal('c', match['second'].matched)
assert_equal('b', match[1].matched) ;Unnamed groups first
assert_equal('a', match[2].matched) ;But .NET groups get numbered after
assert_equal('c', match[3].matched)

(?<= ), (?<! ) - look-behind

If a parenthesized group begins with "?<=" or "?<!", then the remainder of the group is matched against the part of the string that has already been consumed. If a match can be found that ends at the current location, then the outer match proceeds as if the group were not there. Otherwise, it doesn't match.

You can use any regular expression syntax within a look-behind, including group capture and anchors.

You can also use a look-behind as the condition in a conditional.

Examples:

R$('/(?<=(\w+)\s+)Smith/') - matches "Smith" when it is preceded by another word, and captures that word as sub-expression 1.

R$('/\b\w+\b(?<!s)/') - matches words that don't end with 's'.

(?= ), (?! ) - look-ahead

If a parenthesized group begins with "?=" or "?!", then the remainder of the group is matched against what follows in the string, without changing the position for the rest of the expression. If a match can be found that begins at the current location, then the outer match proceeds as if the group were not there. Otherwise, it doesn't match.

You can use any regular expression syntax within a look-ahead, including group capture and anchors.

You can also use a look-ahead as the condition in a conditional.

Examples:

R$('/John(?=\s+(\w+))/') - matches "John" when it is followed by another word, and captures that word as sub-expression 1.

R$('/(?!\d*5)\d+/') - matches a group of digits, but only if they don't contain a 5.

(?imnsx) - dynamically set options

For each of the letters that this group includes, the corresponding option is turned on for the remainder of the expression. You can explicitly turn off an option by preceding it with a '-'. The options are the same as when specified following the pattern (which sets their initial states), except that Continue (c) and GlobalSearch (g) are not included because those options cannot be applied to only part of an expression.

Examples:

R$('/case(?i)nocase(?-i)case/')
R$('/not extended, single line(?xm) extended mode, multi-line(?-x-m) and back/')

If the options are followed by a colon (:), then they only apply to the expression that follows the colon up to the close parenthesis. The above examples can therefore be rewritten as:

R$('/case(?i:nocase)case/')
R$('/not extended, single line(?xm: extended mode, multi-line) and back/')

When using the first form, be aware that the ExplicitCapture and Extended options affect the operation of the compile phase, while the other three (IgnoreCase, Multiline, and DotMatchesNewline) affect pattern matching. Thus, the former always apply left to right, while the latter can affect earlier parts of the expression when repeating. For example:

R$('/\b([a-z](?-i))+\b/i')

The above matches any word where only the first character may be capitalized or not (thanks to the initial IgnoreCase option selected after the pattern), but the remaining characters must be lowercase. That's because the (?-i) still applies on the second and following repetitions of the loop. While this can be useful, it can also be confusing. To avoid these situations, use the encapsulated form (with the colon) where the scope of the setting is clear.

(?P<group> ) - named group sub-expression (Python syntax)

Creates a sub-expression that does have a numbered back-reference, but may also be referenced by the name group. Group names are always case-sensitive. They may contain any characters except > and ). The text following the >, up to but not including the ) is the sub-expression to capture.

Group names do not have to be unique. In cases where they collide, both sub-expressions map to the same collector, which may be accessed by name or by number. The number assigned to all sub-expressions having the same name will be based on when the first one was encountered. When back-referenced or accessed from the replace or MatchData object, the contents of the group will be from the last time a group with that name was matched.

This form of named group capture is based on Python syntax. Unlike the .NET syntax for named groups, these groups are numbered where they occur, along with unnamed groups.

Examples:

match = R$('/a(?<els>l+)i/').match('alligator')
assert(match['els'].start == 2)
assert(match['els'].length == 2)
assert(match[1].start == 2)
assert(match[1].length == 2)

r1 = r$('/a(?P<x>\d)|b(?P<x>\w)/')
match = r1.match('a5')
assert_equal('5', match['x'].matched)
match = r1.match('bd')
assert_equal('d', match['x'].matched)

(?P=group) - named group back-reference

Back-references the text previously captured in the named group group. If that group has not yet been captured, an empty string is used.

Example:

match = r$('/(?P<quote>[''"]).*?(?P=quote)/').match('I said, "She''s a witch"')
assert_equal(9, match.start)
assert_equal(15, match.length)

* - zero or more

Also known as a Kleene closure, this operator matches as many of the preceding expression as possible (greedy search, but see ? below), but can match none of them, depending on the constraints of the rest of the expression.

Example: R$('/A*B/') matches 'B', 'AB', 'ZAAB', and even 'AZB' (because the final 'B' matches the "zero A's followed by B" case.

+ - one or more

This operator matches at least one, but as many as possible (greedy search, but see ? below), of the preceding expression.

Example: R$('/A+B/') matches 'AB' and 'ZAAB', but not 'B' or 'AZB'.

. - any single character

This metacharacter matches any character (except sometimes newline). It's often used to skip over stuff you don't care what it is. But be careful of searches that are greedier than you intended.

By default, the period does not match a newline character (linefeed, char(10)). You can force period to match newline by enabling the DotMatchesNewline option.

Example: R$('/A.B/') matches 'AAB' and 'ACB', but not 'AB'. R$('/.*B/') matches 'ABRACADABRA' at position 1, but the length of the match is 9 (inclusive of the second 'B').

? - zero or one

This operator matches one or none of the preceding expression. It prefers one by default (greedy), but if immediately followed by another ? (see below) it prefers none.

Example: R$('/A?B/') matchs 'B', 'AB', 'ZAAB' (at position 3) and even 'AZB' (also at position 3).

? - non-greedy match

If a ? immediately follows a *, +, or ? operator, then the match is non-greedy, meaning that it prefers fewer repetitions rather than greater.

Example: R$('/A*?/') matches 'AAA' at position 1, with a length of zero, whereas without the ? it would match to a length of 3.

[ ] - character class

The characters between matching square brackets forms a specification of a character class. In its simplest form, it's just a list of the characters that qualify. For example: R$('/[xyz]/') matches "x", "y" or "z". In this form, it's synonymous with R$('/x|y|z/').

If the first character within the brackets is "^", then the sense of the character class is reversed. That is, anything except these characters. For example, R$('/[^abc]/') will not match any of the first three lower-case letters of the alphabet.

The back-slash can be used to introduce special characters into the sequence, as detailed below. Thus, R$('/[\d.]/') matches any numeric digit or a period. Note that the period (along with most special characters) has a literal meaning when used within a character class.

A dash ('-') can be used to specify a range of characters. For example, R$('/[A-Z]/') matches all uppercase letters.

If a dash is immediately followed by a [, then it is taken to mean character class subtraction, and the [ is expected to have a matching ] to indicate the end of the class to subtract from the main class. For instance, to match all consonants, you could use R$('/[a-z-[aieou]]/i'). Note, therefore, that to end a range with [ you would need to escape it with a \. You may subtract more than one subclass, and you may nest subtractions. For example, R$('/[a-z-[k-o-[lm]]]'/) is functionally equivalent to R$('/[a-z-[kno]]/'), which could also be expressed as R$('/[a-jlmp-z]/'). Subtraction is more useful for readability than performance: R$('/[\d-[5]]/') says "all digits except 5" better than R$('/[0-46-9]/') does.

POSIX character classes can be enclosed within another level of brackets and colons. For instance, R$('/[[:punct:]]/') matches a punctuation character. See the list of supported character classes.

Once introduced, a character class becomes an expression like any other character. Thus, it can be repeated with '*', '+', or '?'. For example, R$('/[A-Za-z]*/)' matches any number of letters.

\ - escape a special character

The back-slash can be used to insert the character that follows, without special interpretation. However, there are some characters that have special meaning when following a back-slash, depending on context.

Legend
Only available within a character class
Only available when not within a character class
Available in both contexts

EscapeMeaning
\1..\9back-reference to a previously captured group
\0..\377the character corresponding to the specified octal value
\abell (\x07)
\Aanchor to beginning of text (like ^, but does not match the beginning of multiple lines when Multiline is enabled)
\bbackspace (\x08)
\ba word boundary
\Ba word non-boundary
\cA..\cZ
\ca..\cz
the corresponding control character (\x01..\x1A)
\da numeric digit ('0'..'9')
\Danything except a numeric digit.
\eescape (\x1B)
\fform feed (\x0C)
\Gend of last match*
\k<group>
\k'group'
named group back-reference
\mbeginning of word
\Mend of word
\nnew line (\x0A)
\p{Classname}POSIX character class classname
\Qescapes all subsequent text as literal until \E
\rreturn (\x0D)
\swhite space [ \t\n\v\f\r]
\Sanything except white space
\ttab (\x09)
\vvertical tab (\x0B)
\wword character [A-Za-z0-9_]
\Wanything except a word character
\ya word boundary (like \b)
\Ya word non-boundary (like \B)
\zanchor to end of text (like $, but does not match the end of multiple lines when Multiline is enabled)
\Zlike \z, but if the text ends with a linefeed, matches the position of the final linefeed
\x00..\xFFthe character corresponding to the specified hexadecimal value.
\`anchor to beginning of text (like \A)
\'anchor to end of text (like \z)
\<beginning of word (like \m)
\>end of word (like \M)

* The operation of \G differs between various regex engines, so it bears explaining here. \G represents an anchor to the end of the last match, meaning that what follows \G must match what follows that position. The actual position used is maintained in the public member ContinueFrom, so that implies that it is Regex-specific, but not string-specific. If you reuse a Regex that contains \G with another target string, you may want to clear the ContinueFrom member first. Conversely, if you want to use another Regex with the same string and have it continue from the match of the other Regex, you must first copy the ContinueFrom value from the first Regex to the second one. When a match fails, ContinueFrom is set to 0, unless the Continue option is true.

^ - anchor to beginning of line

The match must begin with the first character of a line of text in order to be accepted. If Multiline is true, then an embedded newline character qualifies as marking the beginning of line, but the beginning of the string always qualifies.

Example: R$('/^front/') matches 'front-end', but not 'affront'.

{ } - specify minimum/maximum repetitions

This operator optionally specifies the minimum and maximum number of repetitions for the term that it follows. It can take any of the following forms:

  • {n} - repeat exactly n times.
  • {n,m} - repeat at least n times, but not more than m times.
  • {n,} - repeat at least n times, but with no maximum.
  • {,m} - repeat no more than m times, but with no minimum.
Example: R$('/(\d{1,3}\.){3}\d{1,3}/') matches an IPv4 address (though it doesn't verify that each number is 255 or less). This expression reads "1 to 3 digits followed by a dot, repeated 3 times, followed by 1 to 3 digits."

| - union (or)

This operator occurs between two expressions to specify that either one or the other is required. It has the lowest operator priority of any operator, so to prevent everything on one side or the other from being lumped together, use parentheses.

Example: R$('/c(a|u)t/') matches 'cat' and 'cut' at position 1. But R$('/ca|ut/') matches 'cut' at position 2 ("ut").

Note: when used in a conditional group, this operator can be taken as an "else" instead.

POSIX character classes

This parser supports the following POSIX character class names, which are not case-sensitive:

Class NameDescriptionCharacters included
alnumAlphanumeric charactersa-zA-z0-9
alphaAlphabetic charactersa-zA-Z
asciiASCII (7-bit)\x00-x7F
blankSpace and tab\x20\t
cntrlControl characters and DEL\x00-\x1F\x7F
digitDigits0-9
graphVisible characters (characters with graphemes)\x21-\x7E
lowerLowercase lettersa-z
printPrintable characters\x20-\x7E
punctPunctuation!"#$&'()*+,\-./:;<=>?@[\]^_`{|}~
spaceWhitespace\x20\t\r\n\v\f
upperUppercase lettersA-Z
wordWord charactersA-Za-z0-9_
xdigitHexadecimal digits0-9A-Fa-f

You may include a POSIX character class within a character class by enclosing it within "[:" and ":]", or by using the Java syntax \p{Classname}. The latter syntax may also be used outside a character class. Neither of these flavors are treated as case-sensitive by this implementation, even though they both are in other parsers (with the Java flavor capitalizing some letters).

Features not yet implemented

I don't have any items left on my to-do list. If you can think of any features you'd like to see added, please contact me.

Features that will never be implemented, and why

I hate to say never, but the following features do not match my pattern (pun intended) for Regex. If you can make a good argument for including any of these, please let me know.