Regular Expression Object
Regular expression is a technology of
using an expression to compare
with the text being searched. The expression used for comparing is a pattern of
characters which are constructed following some rules and regulation. By
matching the expression with the text string, regular expression can be used to
search for patterns in the text, to replace strings of the text and to extract
substrings of the text. A regular expression object is not a ASP object but a
scripting object with reguular expression features. The regular expression of
VBScript engine is implemented as a COM object.
In general, regular expression is provided as scripting objects with regular
expression features. In other words, the application of regular expression
object follows the regular expression syntax in design and the scripting
language in syntax. Therefore the features of regular expression object include
both the language or syntax of regular expression and objects of regular
expression object in additon to the syntax of scripting language.
Regular
Expression Features
The features of regular expression includes the expression of pattern, and the
character set, the ordinary characters, special characters, and metacharacters
used in regular expression.
Expression
An expression of a regular expression is a symbolic character pattern. In
general sense, regular expression is an arithmetic-like symbolic single line
program bounded by a pair of delimiters.
In JScript, a pair of forward slash (/) characters is used as the
delimiters.
/expression/
In VBScript, a pair of quotation mark (") characters is used as the delimiters.
"expression"
Basically, an expression is used to describe the string used for matching with
the searched text body. An expression is therefore a matching template composed
of ordinary characters and special characters to describe a character pattern
for matching with the string being searched.
Ordinary characters are literal characters bounded by a pair of square brackets
as members of character set or matching characters bounded by the pair of
delimiters outside the square brackets. Ordinary characters always represent or carry the
same meaning of the letter itself.
Special characters are also metacharacters. Special characters are characters that
represent a special
meaning instead of the literal letter itself. A special character usually do not
represent or carry the meaning of the letter itself.
In a general sense, a metacharacter is a character, or a sequence of characters,
that is used to represent a special meaning in a computing application for
easier programming by making use of some seldom used character as indicator.
The Set of Character
Although regular expression use only one set of character, a character used in
regular expressions, between a pair of square brackets, and in replacement
patterns may have different meaning.
The character set of regular expression
for constructing an expression are
- Englisht Alphabet Capital Letters: A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z
- Englisht Alphabet Small Letters: a b c d e f g h i j k l m n o p
q r s t u v w x y z
- Arabic numerals: 0 1 2 3 4 5 6 7 8 9
- Special Symbols of ASCII Symbols: (space) =(equals) +(plus) -(hyphen-minus)
*(asterisk) /(solidus) \(reverse solidus) ^(circumflex accent) ((left
parenthesis) )(right parenthesis) &(ampersand) .(full stop) :(colon)
<(less-than sign) >(greater-than sign) "(quotation mark) '(apostrophe) _(low
line) [(left square bracket) ](right square bracket) |(vertical bar)
Ordinary Characters
Ordinary character
is a character that represents the same letter.
- Englisht Alphabet Capital Letters: A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z
- Englisht Alphabet Small Letters: a b c d e f g h i j k l m n o p
q r s t u v w x y z
- Arabic numerals: 0 1 2 3 4 5 6 7 8 9
- Special Symbols of ASCII Symbols: (space) =(equals) +(plus) -(hyphen-minus)
*(asterisk) /(solidus) \(reverse solidus) ^(circumflex accent) ((left
parenthesis) )(right parenthesis) &(ampersand) .(full stop) :(colon)
<(less-than sign) >(greater-than sign) "(quotation mark) '(apostrophe) _(low
line) [(left square bracket) ](right square bracket)
Special Characters
Special character is a
metacharacter that represents a special meaning instead of
the literal letter itself.
- *: to match the previous character or subexpression zero or more times. Equivalent to {0,}.
- +: to match the previous character or subexpression one or more times.
Equivalent to {1,}.
- ?: to match the previous character or subexpression zero or one times.
Equivalent to {0,1}. to {0,1}.
to match the previous immediate quantifier zero or one times. Equivalent to make
the matching pattern non-greedy.
- ^: to match the start position of the searched string. And to match
the position following \n or \r when the Multiline property is set.
but to match the negative of the character set when used as the first character
in a bracket expression.
- $: to match the end position of the searched string. And to match
the position before \n or \r when the Multiline property is set.
- .: to match any single character except the newline character (\n).
- [: to mark the start of a bracket expression.
- ]: to mark the end of a bracket expression.
- {: to mark the start of a quantifier expression.
- }: to mark the end of a quantifier expression.
- (: to mark the start of a subexpression.
- ): to mark the end of a subexpression.
- |: to indicate a choice between two or more items.
- /: to denote the start of a literal regular exression pattern in
JScript.
to denote the end of a literal regular expression pattern in JScript and
single-character flags can be added to specify search behavior.
- ": to denote the start of a literal regular exression pattern in
VBScript.
to denote the end of a literal regular expression pattern in JScript.
- \: to mark the next character as a special character, a literal, a
backreference, or an octal escape.
- -:to match a range of characters between the pre- and post- hyphen
characters inside a square bracket
MetaCharacters
Besides the special characters, metacharacters can also be a sequence of
characters, that is escaped characters and grouped characters.
Escaped Characters
Escaped characters are
specific matching characters that are represented by preceding with a backslash (\)
character. Escape character may be used to represent ordinary character,
nonprinting character, or metacharacter.
Ordinary Characters
Escape character can be used to represent ordinary charcter as matching
character or literal character. The escape character (a single backslash \) is
used to indicate that the following special character is not an operator.
- \*: to match or represent a letter *
- \+: to match or represent a letter +
- \?: to match or represent a letter ?
- \^: to match or represent a letter ^
- \$: to match or represent a letter $
- \.: to match or represent a letter .
- \[: to match or represent a letter [
- \]: to match or represent a letter ]. \ is usually not necessary
- \{: to match or represent a letter {
- \}: to match or represent a letter }. \ is usually not necessary
- \(: to match or represent a letter (
- \): to match or represent a letter ). \ is usually not necessary
- \|: to match or represent a letter |
- \/: to match or represent a letter /
- \\: to match or represent a letter \
- \-: to match or represent a letter -. \ is usually not necessary
when is placed outside the square brackets, or is not placed between
alphanumeric characters inside the square brackets
Nonprinting Characters
Escape character can be used to represent nonprinting charcter as matching
string or literal string. The escape character (a single backslash \) is used to
indicate that the following character other than special characters and
character set [...], may be a defined nonprinting character.
- \f: to match or represent a form-feed character. Equivalent to \x0C and \cL
- \n: to match or represent a newline character. Equivalent to \x0A and \cJ
- \r: to match or represent a carriage return character. Equivalent to \x0D
and \cM
- \t: to match or represent a tab character. Equivalent to \x09 and \cI
- \v: to match or represent a vertical tab character. Equivalent to \x0B and \ck
MetaCharacters
by escaped character
Escape character can be used to represent a metacharcter as matching strings.
- \b: to match the boundary of a word, that is, the position between a word and a
space.
- \B: to match the non-boundary of a word, that is, the position between the first
and last characters of a word.
- \d: to match a digit character. Equivalent to [0-9]]
- \D: to match a nondigit character. Equivalent to [^0-9]
- \s: to match or represent any white-space character, including space, tab, and form
feed. Equivalent to [ \f\n\r\t\v]
- \S: to match or represent any non-white space character. Equivalent to [^ \f\n\r\t\v]
- \w: to match any word (alphanumeric and underscore) character, that
is A-Z, a-z, 0-9, and underscore. Equivalent to [A-Za-z0-9_]
- \W: to match any non word (alphanumeric and underscore) character,
that is any character except A-Z, a-z, 0-9, and underscore. Equivalent to
[^A-Za-z0-9_]
MetaCharacters by escaped word
Escape
word can be used to represent a metacharcter as matching strings.
- \cx: to match the ASCII control character specified by x. x must be in
the range of A-Z or a-z, otherwise c is assumed to be a literal "c" character,
that is a simple escaped character.
- \xn: to match the ASCII character specified by n, where n is a
hexadecimal escape value of an ASCII code with exactly two digits.
- \num: to match the saved match specified by num,
where num is a positive integer reference of a saved match.
- \n: a searching identifier to match either a backreference or
an octal escape character of an ASCII code. But in general, \1 through \9 always
refer to backreferences. For only one digit,
- If n=0, n is an octal digit of the octal escape value of an escape
character.
- If n>=1 and n<=7, and \n is preceded by at least n captured
subexpressions, n is the reference number of a backreference. Otherwise, n is
the octal escape value of an escape character.
- If n=8, or n=9, n is the reference number of a backreference.
- \nm: a searching identifier to match either a
backreference or an octal escape character of an ASCII code. In general, \nm is
considered as a backreference, only if there is a backreference corresponding to
the specified number. For only two
digits,
- If n=0, nm is octal digits of the octal escape value of an escape
character. m can only be an octal digit (0-7), otherwise, m is a
literal digit m and \nm backtrack to \n.
- If n>=1 and n<=7, and
- If \nm is preceded by at least nm captured subexpressions, nm is the
reference number of a backreference.
- If \nm is preceded by at least n and less
than nm captured subexpressions, m can only be a literal digit m and \nm
backtrack to \n.
- If \nm is preceded by least than n captured subexpressions, nm
can only be octal escape value of an escape character
- If nm is octal digits of the octal escape value of an escape
character. m can only be an octal digit (0-7), otherwise, m is a
literal digit m and \nm backtrack to \n.
- If n=8, or n=9, nm is the reference number of a backreference.
- If \nm is preceded by at least nm captured subexpressions, nm is the
reference number of a backreference.
- If \nm is preceded by at least n and less
than nm captured subexpressions, m can only be a literal digit m and \nm
backtrack to \n.
- \nml: a searching identifier to match either a
backreference or an octal escape character of an ASCII code. In general, unless
there is a backreference corresponding to the specified number, \nml is usualy
considered as an octal escape value of an escape character if n is an octal
digit of 0-3, and m and l are octal digits of 0-7.
- \un: to match the Unicode character specified by n, where n is a
hexadecimal escape value of a Unicode code with exactly four digits.
Grouped Charactersrs
Escaped characters can also be treated as a metacharacter to represent matching strings.
- [...]: to mark the boundary of a character set in a bracket
expression.
- [^...]: to mark the boundary of a negative character set in a
bracket expression.
- [xyz]: to match any one of the specified characters of the
character set between the pair of square brackets. e.g. x, y, or z.
- [^xyz]: to match not any one of the specified characters of the
character set between the pair of square brackets with the indicator ^. e.g. not
x, y, and z.
- [a-z]: to match any one of the specified range of characters inside
the pair of square brackets. e.g. a, b, c, ..., or z.
- [^a-z]: to match not any one of the specified range of characters
inside the pair of square brackets with the not indicator ^ at the start of the
square bracket set. e.g. not a, b, c, ..., and z.
- {...}: to mark the boundary of a quantifier expression.
- {n}: to match the previous character or subexpression exactly n
times where n must be a nonnegative integer.
- {n,}: to match the previous character or subexpression at least n
times where n must be a nonnegative integer.
- {n,m}: to match the previous character or subexpression at least n
and at most m times where n and m must be a nonnegative integer, and n<=m.
- (...): to mark the boundary of a subexpression.
- (pattern): to match the pattern in the subexpression
as one individual group and save the match.
- (?:pattern): to match the pattern in the
subexpression as one individual group only.
- (?=pattern): a positive lookahead searching test to match the pattern
in the subexpression as one individual group before the search for the next
match before the matched text can be started.
- (?!pattern): a negative lookahead searching test to match the pattern
in the subexpression as one individual group before the search for the next
match before the matched text can be started.
Regular
Expression Objects
Regular expression object is the genetic name used to name the group of
scripting objects that related to regular expression. The VBscripting regular
expression objects include VBScript RegExp Object, VBScript Matches Collection
Object and VBScript Match Object.
- VBScript RegExp Object provides 3 properties and 3 methods
- Properties
- Pattern: a string expression used to define the regular expression
- IgnoreCase: a boolean used to indicate whether the case of letter in a string
should be considered or not
- Global: a boolean used to indicate whether the all possible matches in a string
should be tested or not
- methods
- Test: to test and return a boolean value whether the regular expression
test can be successfully matched or not.
- Replace: to replace and return the computed searched-string of which a replaced
searched-string is returned if the searched-string can be successfully matched
otherwise the original searched-string is returned.
- Execute: to replace and return a computed string of which a copy of replaced
searched-string is returned if the searched-string can be successfully matched
otherwise a copy of the original searched-string is returned.
- VBScript Matches Collection Object, which is the result returned from the
RegExp.Execute method, provides 2 read-only properties
- Count: a read-only value of number of Match objects in the collection
- Item: a read-only value of a Match object to be accessed from the Matches
collection object randomly.
- VBScript Match Object, which is the object of each successful match contained
within each Matches collection object, provides 3 read-only propertiess
- FirstIndex: a read-only value of the position of the match occured in the
searched-string.
- Length: a read-only value of the total length of the matched string.
- Value: a default read-only value of the content of the matched string.