Class Regex

Introduction

This class provides Regular Expression matching and substitution for Synergy/DE. It requires Synergy/DE version 9.1.5 or higher.

This page documents Regex version 1.0 (Download source)
Author: Chip Camden

Contents

  1. Introduction
  2. Contents
  3. Explanation of symbols used
  4. Member reference
    1. from - create a Regex
    2. GlobalSearch - global search option
    3. IgnoreCase - ignore case on characters
    4. LastMatch - result of the last match performed
    5. match - match a string
    6. Multiline - multi-line search option
    7. replace - perform substitution
    8. ToString - a string representation of the Regex
  5. Supporting classes
    1. MatchData - result of calling Regex's match method
    2. RegexException - thrown on a syntax error
  6. Syntax for Regular Expressions
  7. Features not yet implemented

Explanation of symbols used

Words in italics indicate an instance of a class. The word corresponds to the class name, except where more than one instance is represented in the same statement. In that case a number (2, 3, etc.) is appended to the class name.

Words in normal typeface are to be taken literally (required punctuation, class name in a static reference, method name, etc.)

The symbol => is used to separate an expression (on the left) from its return value (on the right).

An ellipsis (...) indicates that the previous argument may be repeated any number of times. The description will indicate whether one instance is required.

Member reference

static method from, macro R$, constructor

Regex.from(object) => regex
R$(object) => regex
new Regex(object) => regex

Creates a new Regex from the string representation of object, which may be an alpha expression or any class of object. The resulting string will be compiled immediately, which may throw a RegexException if the syntax is incorrect.

If you .include "chipstips.def", then you may use R$ as a syntactic shortcut for Regex.from.

GlobalSearch

Regex.GlobalSearch => boolean
Regex.GlobalSearch = boolean
GlobalSearch controls whether the replace method will continue to replace text until a match is no longer found. It can be set to true initially with the "g" option in the expression, but you may alter it afterwards here.

IgnoreCase

Regex.IgnoreCase => boolean
Regex.IgnoreCase = boolean
IgnoreCase controls case-sensitivity on matches. The "i" option in the expression can be used to set this value to true initially, but you may change it afterwards by altering this member.

LastMatch

Regex.LastMatch => MatchData
LastMatch contains the MatchData object returned by the last call to match. Note that match is also called by replace, and if the GlobalSearch option was enabled, the last result within replace will be unsuccessful.

method match

Regex.match(a) => MatchData
Regex.match(a, int) => MatchData
Matches the regular expression against a, returning a MatchData object describing the match (or lack thereof). If no match occurred, MatchData's start member is 0. In the second form above, int specifies the beginning position within a for consideration of matches -- but in any case, MatchData's members (start, end, before, etc.) refer to the entire string.

Multiline

Regex.Multiline => boolean
Regex.Multiline = boolean

Multiline controls how match (and consequently, replace as well) treat a newline character (char(10)) embedded in the string being searched. It may be initially set to true via the "m" option on the expression, or you may alter it here.

When Multiline is true, a newline character is treated as an end of line and a beginning of line, so it matches '^' and '$', but cannot be matched by any other pattern (even '.' or '\x0A'). When false, newline is treated just like any other character.

method replace

Regex.replace(a, a2) => string
Regex.replace(a, a2, int) => string

Replaces the substring of a that matches the regular expression with a2. If the GlobalSearch option is enabled (the 'g' option on the expression), then all matches will be replaced -- otherwise only the first one will be.

In the second form, int specifies the beginning position within a at which substitutions may occur.

If a2 contains any '\' characters, the following special substitutions will be performed:

  • \1 thru \9 - the text matching the corresponding parenthesized sub-expression.
  • \& - the text matching the entire expression.
  • \` - the text before the match.
  • \' - the text after the match.
  • \+ - the text matching the highest-numbered matched sub-expression.
  • \ followed by any other character - that character.

override method ToString, explicit alpha cast

Regex.ToString() => string
(a)Regex => string
Returns the original string used to construct the Regex, regardless of any changes to options after construction.

Supporting classes

Class MatchData

A member of this class is returned by the match method, and is also available in the public member LastMatch. It describes a regular expression match, or lack thereof.

Public members:

  • [int] => MatchData - (Indexer) subexpression matches ([0] = this, [1] = \1, etc.). If no match exists for a selected sub-expression, the returned MatchData object will have start and length set to 0 and matched will return "".
  • after => string - the portion of source following the matching substring, or "" if nothing follows or no match occurred.
  • before => string - the portion of source before the matching substring, or "" if nothing preceded or no match occurred.
  • Count => int - the number of elements accessible through the Indexer. I.e., the highest matched sub-expression plus one.
  • end => int - the index of the last character of the match within source.
  • length => int - the length of the matching substring. Zero if no match occurred, or the empty string matched.
  • matched => string - the matching portion of source, or "".
  • replace(string) => string - replace \1..\9, etc. with sub-matches. This is called automatically by Regex's replace method.
  • source => string - the entire original string searched.
  • start => int - the starting index of the match within source, or 0 if no match occurred.
  • ToString() => string - a string representation of the match information.

Class RegexException extends Synergex.SynergyDE.SynException

This exception will be thrown upon construction of a Regex if the expression contains any syntax errors. Its Message member contains the message "Error parsing regular expression: " followed by the text of the expression (or a portion thereof), followed by a more detailed description of the problem:
  • "Empty expression not allowed" - only an initial delimiter was supplied.
  • "Unknown option" - one of the options was not g, i, or m
  • "Missing operand" - an operator didn't have an operand (e.g. '/*/')
  • "Final \ encountered" - the expression ended with '\'
  • "\c expects a capital letter" - the character after '\c' was something else.
  • "\x expects two hex digits" - we didn't get them.
  • "Empty [] not allowed" - an empty character class was encountered.
  • "Missing ]" - a character class wasn't terminated.
  • "Character range requires two characters" - [-], [A-], or [-B].
  • "Invalid character range" - [\s-\d] or [Z-A], for instance.
  • "Unmatched )" - at least one too many close parentheses.

Regular Expression syntax

The string from which a regular expression may be constructed must have the general form:

<delimiter><match><delimiter><options>

where:

<delimiter> is any single character except '\'. Both delimiters must match.
<match> is the regular expression code specifying the match.
<options> is zero or more of the following single-character search options:

Most characters contained within <match> must follow one another immediately in the target string in order to match. For instance, R$('/xyz/') matches 'abcdefghijklmnopqrstuvwxyz' at position 24, but it does not match 'xzy' or 'x y z'.

Some characters, however, have special meaning within a regular expression. If you want to include those characters literally, you must escape them with a preceding '\'. You can also include your delimiter in the same way -- e.g., R$('/\//') matches a '/'. Following is a list of all special characters supported by this implementation, and their meanings.

$ - anchor to end of line

The match must end at the last character of a line of text in order to be accepted. If Multiline is true, then an embedded newline character qualifies as marking the end of a line, but the end of the string always qualifies.

Example: R$('/end$/') matches 'friend', but not 'friends'.

( ) - group sub-expressions

Parentheses can be used to override the usual operator precedence by grouping operations together. Additionally, parenthesized sub-expressions are counted from left to right (by their open parenthesis) to number them from 1 to the number of sub-expressions encountered. The text that matched each sub-expression can be accessed from the MatchData object returned from the method match by indexing it. Thus for example,

match = R$('/.*(c)/').match("abcd")
assert(match[1].start == 3)
assert(match[1].length == 1)

.

Additionally, in the replace method, the replacement string may contain escaped references to these sub-expressions.

If a sub-expression is completed more than once within a match, the last one wins. For overlapping sub-expression matches, the longest one wins.

* - zero or more of the preceding character

Also known as a Kleene closure, this operator matches as many of the preceding expression as possible (greedy search), but can match none of them, depending on the constraints of the rest of the expression.

Example: R$('/A*B/') matches 'B', 'AB', 'ZAAB', and even 'AZB' (because the final 'B' matches the "zero A's followed by B" case.

+ - one or more of the preceding character

This operator matches at least one, but as many as possible (greedy search), of the preceding expression.

Example: R$('/A+B/') matches 'AB' and 'ZAAB', but not 'B' or 'AZB'.

. - any single character

This metacharacter matches any character. It's often used to skip over stuff you don't care what it is. But be careful of searches that are greedier than you intended.

Example: R$('/A.B/') matches 'AAB' and 'ACB', but not 'AB'. R$('/.*B/') matches 'ABRACADABRA' at position 1, but the length of the match is 9 (inclusive of the second 'B').

? - zero or one of the preceding character

This operator matches one or none of the preceding expression.

Example: R$('/A?B/') matchs 'B', 'AB', 'ZAAB' (at position 3) and even 'AZB' (also at position 3).

[ - introduce a character class

From this character up to the matching ']' forms a specification of a character class. In its simplest form, it's just a list of the characters that qualify. For example: R$('/[xyz]/') matches "x", "y" or "z". In this form, it's synonymous with R$('/x|y|z/').

If the first character within the brackets is "^", then the sense of the character class is reversed. That is, anything except these characters. For example, R$('/[^abc]/') will not match any of the first three lower-case letters of the alphabet.

The back-slash can be used to introduce special characters into the sequence, as detailed below. Thus, R$('/[\d.]/') matches any numeric digit or a period. Note that the period (along with most special characters) has a literal meaning when used within a character class.

A dash ('-') can be used to specify a range of characters. For example, R$('/[A-Z]/') matches all uppercase letters.

Once introduced, a character class becomes an expression like any other character. Thus, it can be repeated with '*', '+', or '?'. For example, '/[A-Za-z]*/' matches any number of letters.

\ - escape a special character

The back-slash can be used to insert the character that follows, without special interpretation. However, there are some other characters that have special meaning when following a back-slash:

^ - anchor to beginning of line

The match must begin with the first character of a line of text in order to be accepted. If Multiline is true, then an embedded newline character qualifies as marking the beginning of line, but the beginning of the string always qualifies.

Example: R$('/^front/') matches 'front-end', but not 'affront'.

| - union (or)

This operator occurs between two expressions to specify that either one or the other is required. It has the lowest operator priority of any operator, so to prevent everything on one side or the other from being lumped together, use parentheses.

Example: R$('/c(a|u)t/') matches 'cat' and 'cut' at position 1. But R$('/ca|ut/') matches 'cut' at position 2 ("ut").

Features not yet implemented

The following features that are present in some Regular Expression parsers have not yet been implemented here:

Syntax

Options