This class provides Regular Expression matching and substitution for Synergy/DE.
Words in italics indicate an instance of a class. The word corresponds to the class name, except where more than one instance is represented in the same statement. In that case a number (2, 3, etc.) is appended to the class name.
Words in normal typeface are to be taken literally (required punctuation, class name in a static reference, method name, etc.)
The symbol => is used to separate an expression (on the left) from its return value (on the right).
An ellipsis (...) indicates that the previous argument may be repeated any number of times. The description will indicate whether one instance is required.
Regex.Continue => boolean
Regex.Continue = boolean
This option controls how ContinueFrom is set when a match fails. If true, ContinueFrom is not modified by a failure, otherwise it is set to 0 (the beginning of the string). This affects the operation of \G after a failed match. It may be set to true initially via the "c" option on the expression, or you may alter it directly.
Regex.ContinueFrom => int
Regex.ContinueFrom = int
This value is used to anchor the \G special character. It is set to 0 initially, which means that \G will anchor to the beginning of the string. After a successful match, ContinueFrom is set to the end of the match. On a failed match, it is reset to 0, unless the Continue option is true.
Regex.DotMatchesNewline => boolean
Regex.DotMatchesNewline = boolean
DotMatchesNewline controls how a period matches the newline character (char(10)). It may be set initially to true via the "s" option on the expression, or you may alter it here.
When this option is enabled, period matches newline -- otherwise it does not.
Regex.escape(a) => string
Regex.escape(a, a2) => string
This method returns string as a copy of a with all special characters preceeded by a '\'. If a2 is passed, then its first character is treated as the expression delimiter (which is also escaped in the result). Otherwise, '/' is assumed.
Example:
assert_equal('a\+b\/', Regex.escape('a+b/'))
Regex.ExplicitCapture => boolean
This option controls how group capture back-references are numbered. When true, plain parentheses do not allocate a back-reference -- only named groups get numbered. This option is false by default, but may be initially set to true by the 'n' option in the expression. Altering the value of ExplicitCapture after a Regex has been instantiated has no effect, because the option only affects compilation.
Regex.Extended => boolean
Extended syntax mode causes the parser to ignore white space and end-of-line delimiters within the pattern or options. Additionally, anything between a '#' and an end-of-line delimiter will be ignored as a comment. This option is false by default, but may be set by the 'x' option in the expression. Altering the value of Extended after a Regex has been instantiated has no effect, because the option only affects compilation.
Restrictions: a comment in the options section may only occur after the 'x' option, and may not include the pattern delimiter.
Comments in an expression are not very useful in Synergy/DE, because you can always continue a literal in Synergy/DE with a comment in between in the code. For example:
expr = R$('/^' ;beginning of line
& 'abc|123' ;Jackson five
& '/x') ;extended mode
The above example doesn't even need extended mode, because the concatenation of the continued lines doesn't contain any white space or comments. I have implemented support for comments in extended mode only for thoroughness and compatibility with expressions that might be imported from another language or a text file. Note that 'end of line' for a comment within an expression means the next line-feed, vertical-tab, form-feed, carriage-return, or the end of the expression -- not the end of the Synergy/DE source line.
Regex.from(object) => regex
R$(object) => regex
new Regex(object) => regex
Creates a new Regex from the string representation of object, which may be an alpha expression or any class of object. The resulting string will be compiled immediately, which may throw a RegexException if the syntax is incorrect.
If you .include "synthesis.def", then you may use R$ as a syntactic shortcut for Regex.from.
Regex.GlobalSearch => boolean
Regex.GlobalSearch = boolean
Regex.IgnoreCase => boolean
Regex.IgnoreCase = boolean
Regex.LastMatch => MatchData
Regex.match(a) => MatchData
Regex.match(a, int) => MatchData
Regex.Multiline => boolean
Regex.Multiline = boolean
Multiline controls how match (and consequently, replace as well) treat a newline character (char(10)) embedded in the string being searched. It may be set initially to true via the "m" option on the expression, or you may alter it here.
When Multiline is true, a newline character marks an end of line and a beginning of line, so $ matches its position, and ^ matches the position following. The newline itself can still be matched as a character, but see the DotMatchesNewline option for matching it with a period.
Note that even when Multiline is enabled, you can anchor to the absolute beginning or end of the string without regard to newlines by using \A and \Z, respectively.
Regex.replace(a, a2) => string
Regex.replace(a, a2, int) => string
Replaces the substring of a that matches the regular expression with a2. If the GlobalSearch option is enabled (the 'g' option on the expression), then all matches will be replaced -- otherwise only the first one will be.
In the second form, int specifies the beginning position within a at which substitutions may occur.
If a2 contains any '\' or '$' characters, the following special substitutions will be performed, left to right:
| Escape | Translation |
|---|---|
| \0, $0 or ${0} | the text matching the entire expression. |
| \1 thru \9, or $1 thru $9, or ${1} thru ${2147483647} | the text matching the corresponding parenthesized sub-expression. |
| \& or $& | the text matching the entire expression. |
| \` or $` | the text before the match. |
| \' or $' | the text after the match. |
| \+ or $+ | the text matching the highest-numbered matched sub-expression. |
| \g<group> or ${group} | the text captured in named group group. |
| \x00 thru \xFF | the corresponding ASCII character |
| \ followed by any other character | that character (including \ or $). |
| $$ | $ |
| $ followed by any other character | $ followed by that character |
Regex.split(a) => ls
Regex.split(a, boolean) => ls
This method searches for a match of the regex within a, and returns a list of the portions of the string that occur before and after that expression. If the GlobalSearch option is enabled, then a will be split at every occurence of the expression, otherwise only at the first occurence.
If the expression contains any capturing groups, the contents of those groups will also be returned in ls, wherever they occurred. If boolean is passed and false, then no empty strings ('') will be returned. The default is to return empty strings. For example,r$('/x/').split('xy')
returns a list containing '' and 'y', whereas r$('/x/').split('xy',false)
returns a list containing only 'y'. This also applies to empty strings within captured
groups.
Regex.ToString() => string
(a)Regex => string
A member of this class is returned by the match method, and is also available in the public member LastMatch. It describes a regular expression match, or lack thereof.
Public members:
The string from which a regular expression may be constructed must have the general form:
<delimiter><pattern><delimiter><options>
where:
<delimiter> is any single character except '\'. Both delimiters must match.Most characters contained within <pattern> must follow one another immediately
in the target string in order to match.
For instance, R$('/xyz/') matches 'abcdefghijklmnopqrstuvwxyz' at position 24, but it does
not match 'xzy' or 'x y z'.
Some characters, however, have special meaning within a regular expression. If you want
to include those characters literally, you must escape them with a preceding '\'. You can
also include your delimiter in the same way -- e.g., R$('/\//') matches a '/'. Following
is a list of all special characters supported by this implementation, and their meanings.
The match must end at the last character of a line of text in order to be accepted. If Multiline is true, then an embedded newline character qualifies as marking the end of a line, but the end of the string always qualifies.
Example: R$('/end$/') matches 'friend', but not 'friends'.
Parentheses can be used to override the usual operator precedence by grouping operations together. Additionally, parenthesized sub-expressions are counted from left to right (by their open parenthesis) to number them from 1 to the number of sub-expressions encountered, unless the ExplicitCapture option is enabled. The text that matched each sub-expression can be accessed later on within the pattern by using \1 through \9, or from the MatchData object returned from the method match by indexing it. Thus for example,
match = R$('/.*(c)/').match("abcd")
assert(match[1].start == 3)
assert(match[1].length == 1)
Additionally, in the replace method, the replacement string may contain escaped references to these sub-expressions.
If a sub-expression is completed more than once within a match, the last one wins.
If a parenthesized group begins with "?#", then everything up to the next close parenthesis is taken as a comment and ignored. Note that this is the next parenthesis, not the matching one. You cannot include a close parenthesis in this type of comment (but see the x option).
This class of comment, although matching no input, is taken to be an operand. Thus, it cannot separate an operand and its intended operator. For example:
R$('/a(?# a comment)*/').match('aa')
will have a length of 1, because the '*' repeats the comment, not the 'a'. However,
R$('/a(?# a comment)b/').match('ab')
does match, because the comment matches zero-length text.
If the condition within the inner parentheses is true, then the expression that follows up until the final parenthesis applies. If that expression contains a union (|), then the left side of the union will be treated as the 'then' section (applied if the condition is true), and the right side will be treated as 'else' (applied if the condition is false). To force a union to remain a union, enclose it within another level of parentheses.
A conditional group does not consume a numbered back-reference.
The condition can have one of three forms:
Examples:
R$('/(John )?\bSmith(?(1)son)\b/') - matches 'Smith' alone, but if preceeded by 'John '
then it must be 'Smithson'.
R$('/((?(dot)\.)(?P<dot>\d{1,3})){4}/') - matches an IP address, because the \.
is not applied until a numeric group has been captured.
R$('/(?:(x)|y) = (?(1)0|-1)/') - matches "x = 0" or "y = -1", but no other combinations.
R$('/\d+(?(?=.*\boverdrawn\b)\*)/i') - matches a number, which must include a trailing '*' if the word "overdrawn"
occurs later on in the string.
R$('/\d(?(?<=[0-5])A|B)/') - matches "0A", "1A", "2A", "3A", "4A", "5A", "6B", "7B", "8B", and "9B"
If a parenthesized group begins with "?:", then it operates exactly like a normal sub-expression, except that it can't be subsequently referenced by a sub-expression number. That means that the MatchData object returned by match does not include that sub-expression in its indexer, nor can it be referenced later in the pattern via \1..\9, nor accessed within the a2 argument to replace. The sub-expression is not included in the count of sub-expressions, so the next counted sub-expression will have the index following the previous one.
Example:
match = R$('/(?:a|l)*(.*)tor/').match('alligator')
assert(match[1].start == 4)
assert(match[1].length == 3)
To make this behavior the default for unadorned parentheses, see the ExplicitCapture option.
For compatibility with .NET named group capture, I've included support for the .NET syntax. .NET named groups differ from the Python style named groups not only in syntax, but also in the assignment of numbered back-references. While Python groups are numbered along with unnumbered groups, .NET groups are numbered after all other groups -- unless they duplicate a Python group name, in which case they get the number of the group they duplicate. As implied by that exception, .NET group names share the same namespace as Python named groups. They can be used interchangeably, and you can use either syntax for back-references and text-replacement.
Examples:
match = r$("/(?'first'a)(b)(?<second>c)/").match('abc')
assert_equal('a', match['first'].matched)
assert_equal('c', match['second'].matched)
assert_equal('b', match[1].matched) ;Unnamed groups first
assert_equal('a', match[2].matched) ;But .NET groups get numbered after
assert_equal('c', match[3].matched)
If a parenthesized group begins with "?<=" or "?<!", then the remainder of the group is matched against the part of the string that has already been consumed. If a match can be found that ends at the current location, then the outer match proceeds as if the group were not there. Otherwise, it doesn't match.
You can use any regular expression syntax within a look-behind, including group capture and anchors.
You can also use a look-behind as the condition in a conditional.
Examples:
R$('/(?<=(\w+)\s+)Smith/') - matches "Smith" when it is preceded by another word, and captures
that word as sub-expression 1.
R$('/\b\w+\b(?<!s)/') - matches words that don't end with 's'.
If a parenthesized group begins with "?=" or "?!", then the remainder of the group is matched against what follows in the string, without changing the position for the rest of the expression. If a match can be found that begins at the current location, then the outer match proceeds as if the group were not there. Otherwise, it doesn't match.
You can use any regular expression syntax within a look-ahead, including group capture and anchors.
You can also use a look-ahead as the condition in a conditional.
Examples:
R$('/John(?=\s+(\w+))/') - matches "John" when it is followed by another word, and captures
that word as sub-expression 1.
R$('/(?!\d*5)\d+/') - matches a group of digits, but only if they don't contain a 5.
For each of the letters that this group includes, the corresponding option is turned on for the remainder of the expression. You can explicitly turn off an option by preceding it with a '-'. The options are the same as when specified following the pattern (which sets their initial states), except that Continue (c) and GlobalSearch (g) are not included because those options cannot be applied to only part of an expression.
Examples:
R$('/case(?i)nocase(?-i)case/')
R$('/not extended, single line(?xm) extended mode, multi-line(?-x-m) and back/')
If the options are followed by a colon (:), then they only apply to the expression that follows the colon up to the close parenthesis. The above examples can therefore be rewritten as:
R$('/case(?i:nocase)case/')
R$('/not extended, single line(?xm: extended mode, multi-line) and back/')
When using the first form, be aware that the ExplicitCapture and Extended options affect the operation of the compile phase, while the other three (IgnoreCase, Multiline, and DotMatchesNewline) affect pattern matching. Thus, the former always apply left to right, while the latter can affect earlier parts of the expression when repeating. For example:
R$('/\b([a-z](?-i))+\b/i')
The above matches any word where only the first character may be capitalized or not (thanks to the initial IgnoreCase option selected after the pattern), but the remaining characters must be lowercase. That's because the (?-i) still applies on the second and following repetitions of the loop. While this can be useful, it can also be confusing. To avoid these situations, use the encapsulated form (with the colon) where the scope of the setting is clear.
Creates a sub-expression that does have a numbered back-reference, but may also be referenced by the name group. Group names are always case-sensitive. They may contain any characters except > and ). The text following the >, up to but not including the ) is the sub-expression to capture.
Group names do not have to be unique. In cases where they collide, both sub-expressions map to the same collector, which may be accessed by name or by number. The number assigned to all sub-expressions having the same name will be based on when the first one was encountered. When back-referenced or accessed from the replace or MatchData object, the contents of the group will be from the last time a group with that name was matched.
This form of named group capture is based on Python syntax. Unlike the .NET syntax for named groups, these groups are numbered where they occur, along with unnamed groups.
Examples:
match = R$('/a(?<els>l+)i/').match('alligator')
assert(match['els'].start == 2)
assert(match['els'].length == 2)
assert(match[1].start == 2)
assert(match[1].length == 2)
r1 = r$('/a(?P<x>\d)|b(?P<x>\w)/')
match = r1.match('a5')
assert_equal('5', match['x'].matched)
match = r1.match('bd')
assert_equal('d', match['x'].matched)
Back-references the text previously captured in the named group group. If that group has not yet been captured, an empty string is used.
Example:
match = r$('/(?P<quote>[''"]).*?(?P=quote)/').match('I said, "She''s a witch"')
assert_equal(9, match.start)
assert_equal(15, match.length)
Also known as a Kleene closure, this operator matches as many of the preceding expression as possible (greedy search, but see ? below), but can match none of them, depending on the constraints of the rest of the expression.
Example: R$('/A*B/') matches 'B', 'AB', 'ZAAB', and even 'AZB' (because
the final 'B' matches the "zero A's followed by B" case.
This operator matches at least one, but as many as possible (greedy search, but see ? below), of the preceding expression.
Example: R$('/A+B/') matches 'AB' and 'ZAAB', but not 'B' or 'AZB'.
This metacharacter matches any character (except sometimes newline). It's often used to skip over stuff you don't care what it is. But be careful of searches that are greedier than you intended.
By default, the period does not match a newline character (linefeed, char(10)). You can force period to match newline by enabling the DotMatchesNewline option.
Example: R$('/A.B/') matches 'AAB' and 'ACB', but not 'AB'.
R$('/.*B/') matches 'ABRACADABRA' at position 1, but the length of
the match is 9 (inclusive of the second 'B').
This operator matches one or none of the preceding expression. It prefers one by default (greedy), but if immediately followed by another ? (see below) it prefers none.
Example: R$('/A?B/') matchs 'B', 'AB', 'ZAAB' (at position 3) and
even 'AZB' (also at position 3).
If a ? immediately follows a *, +, or ? operator, then the match is non-greedy, meaning that it prefers fewer repetitions rather than greater.
Example: R$('/A*?/') matches 'AAA' at position 1, with a length of zero,
whereas without the ? it would match to a length of 3.
The characters between matching square brackets forms a specification of
a character class. In its simplest form, it's just a list of the characters
that qualify. For example: R$('/[xyz]/') matches "x", "y" or "z".
In this form, it's synonymous with R$('/x|y|z/').
If the first character within the brackets is "^", then the sense of the
character class is reversed. That is, anything except these characters.
For example, R$('/[^abc]/') will not match any of the first three
lower-case letters of the alphabet.
The back-slash can be used to introduce special characters into the sequence,
as detailed below. Thus, R$('/[\d.]/') matches any numeric digit
or a period. Note that the period (along with most special characters) has
a literal meaning when used within a character class.
A dash ('-') can be used to specify a range of characters. For example,
R$('/[A-Z]/') matches all uppercase letters.
If a dash is immediately followed by a [, then it is taken to mean
character class subtraction, and the [ is expected to have a matching ]
to indicate the end of the class to subtract from the main class. For instance,
to match all consonants, you could use R$('/[a-z-[aieou]]/i'). Note, therefore,
that to end a range with [ you would need to escape it with a \. You may
subtract more than one subclass, and you may nest subtractions. For example,
R$('/[a-z-[k-o-[lm]]]'/) is functionally equivalent to R$('/[a-z-[kno]]/'), which could also
be expressed as R$('/[a-jlmp-z]/'). Subtraction is more useful for readability than
performance: R$('/[\d-[5]]/') says "all digits except 5" better than R$('/[0-46-9]/') does.
POSIX character classes can be enclosed within another level of brackets and colons.
For instance, R$('/[[:punct:]]/') matches a punctuation character. See the list
of supported character classes.
Once introduced, a character class becomes an expression like any
other character. Thus, it can be repeated with '*', '+', or '?'. For
example, R$('/[A-Za-z]*/)' matches any number of letters.
The back-slash can be used to insert the character that follows, without special interpretation. However, there are some characters that have special meaning when following a back-slash, depending on context.
| Legend |
|---|
| Only available within a character class |
| Only available when not within a character class |
| Available in both contexts |
| Escape | Meaning |
|---|---|
| \1..\9 | back-reference to a previously captured group |
| \0..\377 | the character corresponding to the specified octal value |
| \a | bell (\x07) |
| \A | anchor to beginning of text (like ^, but does not match the beginning of multiple lines when Multiline is enabled) |
| \b | backspace (\x08) |
| \b | a word boundary |
| \B | a word non-boundary |
| \cA..\cZ \ca..\cz | the corresponding control character (\x01..\x1A) |
| \d | a numeric digit ('0'..'9') |
| \D | anything except a numeric digit. |
| \e | escape (\x1B) |
| \f | form feed (\x0C) |
| \G | end of last match* |
| \k<group> \k'group' | named group back-reference |
| \m | beginning of word |
| \M | end of word |
| \n | new line (\x0A) |
| \p{Classname} | POSIX character class classname |
| \Q | escapes all subsequent text as literal until \E |
| \r | return (\x0D) |
| \s | white space [ \t\n\v\f\r] |
| \S | anything except white space |
| \t | tab (\x09) |
| \v | vertical tab (\x0B) |
| \w | word character [A-Za-z0-9_] |
| \W | anything except a word character |
| \y | a word boundary (like \b) |
| \Y | a word non-boundary (like \B) |
| \z | anchor to end of text (like $, but does not match the end of multiple lines when Multiline is enabled) |
| \Z | like \z, but if the text ends with a linefeed, matches the position of the final linefeed |
| \x00..\xFF | the character corresponding to the specified hexadecimal value. |
| \` | anchor to beginning of text (like \A) |
| \' | anchor to end of text (like \z) |
| \< | beginning of word (like \m) |
| \> | end of word (like \M) |
* The operation of \G differs between various regex engines, so it bears explaining here. \G represents an anchor to the end of the last match, meaning that what follows \G must match what follows that position. The actual position used is maintained in the public member ContinueFrom, so that implies that it is Regex-specific, but not string-specific. If you reuse a Regex that contains \G with another target string, you may want to clear the ContinueFrom member first. Conversely, if you want to use another Regex with the same string and have it continue from the match of the other Regex, you must first copy the ContinueFrom value from the first Regex to the second one. When a match fails, ContinueFrom is set to 0, unless the Continue option is true.
The match must begin with the first character of a line of text in order to be accepted. If Multiline is true, then an embedded newline character qualifies as marking the beginning of line, but the beginning of the string always qualifies.
Example: R$('/^front/') matches 'front-end', but not 'affront'.
This operator optionally specifies the minimum and maximum number of repetitions for the term that it follows. It can take any of the following forms:
R$('/(\d{1,3}\.){3}\d{1,3}/') matches an IPv4 address (though it doesn't verify that each number is 255 or less).
This expression reads "1 to 3 digits followed by a dot, repeated 3 times, followed by 1 to 3 digits."
This operator occurs between two expressions to specify that either one or the other is required. It has the lowest operator priority of any operator, so to prevent everything on one side or the other from being lumped together, use parentheses.
Example: R$('/c(a|u)t/') matches 'cat' and 'cut' at position 1.
But R$('/ca|ut/') matches 'cut' at position 2 ("ut").
Note: when used in a conditional group, this operator can be taken as an "else" instead.
This parser supports the following POSIX character class names, which are not case-sensitive:
| Class Name | Description | Characters included |
|---|---|---|
| alnum | Alphanumeric characters | a-zA-z0-9 |
| alpha | Alphabetic characters | a-zA-Z |
| ascii | ASCII (7-bit) | \x00-x7F |
| blank | Space and tab | \x20\t |
| cntrl | Control characters and DEL | \x00-\x1F\x7F |
| digit | Digits | 0-9 |
| graph | Visible characters (characters with graphemes) | \x21-\x7E |
| lower | Lowercase letters | a-z |
| Printable characters | \x20-\x7E | |
| punct | Punctuation | !"#$&'()*+,\-./:;<=>?@[\]^_`{|}~ |
| space | Whitespace | \x20\t\r\n\v\f |
| upper | Uppercase letters | A-Z |
| word | Word characters | A-Za-z0-9_ |
| xdigit | Hexadecimal digits | 0-9A-Fa-f |
You may include a POSIX character class within a character class by enclosing it within "[:" and ":]", or by using the Java syntax \p{Classname}. The latter syntax may also be used outside a character class. Neither of these flavors are treated as case-sensitive by this implementation, even though they both are in other parsers (with the Java flavor capitalizing some letters).
I don't have any items left on my to-do list. If you can think of any features you'd like to see added, please contact me.
I hate to say never, but the following features do not match my pattern (pun intended) for Regex. If you can make a good argument for including any of these, please let me know.