Using Regular Expressions

Several SiteScope monitors allow for content matching on the text returned from the monitor's request or action (see the documentation for each monitor for applicability). This adds an important level of functionality to SiteScope SiteScope makes use of regular expressions to match text content.

Regular expressions is a name given to a text parsing tool that was developed for use with scripting languages such as Awk and Perl as well as several programming environments such as Emacs, Visual C++, and Java. Regular expressions themselves are not a programming language. They do make use of many special combinations of characters and symbols which often make them more difficult to interpret than some programming languages. The many different combinations of these special characters, known as metacharacters, make regular expressions a very powerful and flexible tool for parsing and isolating specific text within a larger body of text.

Including a regular expression in the Match Content text box of a monitor instructs SiteScope to parse the text returned to the monitor when it is runs and look for content that satisfies the pattern defined by the regular expression. This document presents an overview of the syntax and metacharacters used in regular expressions for use in matching content for SiteScope monitors.

Defining a Regular Expression
Matching String Literals
- Using Alternation
Matching Patterns with Metacharacters
- Defining Character Classes
- Using Quantifiers
Search Mode Modifiers
Retaining Content Match Values
SiteScope Date Variables
Some Pitfalls in Working with Regular Expressions

Defining a Regular Expression

The first step in building a match content expression in SiteScope is use of the forward slash: /. Entries in the Match Content text box of a SiteScope monitor must start and end with a forward slash / to be recognized as regular expressions. For example entering the expression /website/ into the Match Content box of a monitor would instruct SiteScope to search through the text content received by the monitor to find the text string: website. If a match is not found, the monitor reports an error status. When a match is found, the monitor reports a good status as long as all other monitor conditions are also met. If you enter text or other characters into the Match Content box without delimiting the entry with forward slashes will either be ignored or reported as a content match error by SiteScope.

Adding parentheses ( ) within the forward slashes surrounding the regular expression is another very useful feature for regular expressions in SiteScope. The parentheses are used to create what is called a "back reference". As a back reference SiteScope retains what was matched between the parentheses and displays the text in the Status field of the monitor detail page. This is very useful for troubleshooting match content. This is also a way to pass a matched value from one monitor to another or from one step of a URL Sequence Monitor to the next step of the same transaction. Parentheses are also used to limit alternations which are discussed below.

Matching String Literals

Finding and matching an exact or literal string is the simplest form of pattern matching with regular expressions. In matching literals, regular expressions behave much as they do in a search/replace feature in word processing applications. The example above matched the text website. The regular expression /Buy Now/ will succeed if the text returned to the monitor contains the characters Buy Now, including the space, in that order.

It is important to note that regular expressions are, by default, case-sensitive and literal. That means that the content must match the expression in case and order, including non-alphanumeric characters. For example a regular expression of /website/, without any modifiers, will only succeed if the content contains the string website exactly but will fail even if the content is Website, WEBSITE, or Web site. (In the last case the match fails because of the space between the two words.)

There are cases where you may want to literally match certain non-alphanumeric characters which are "reserved" metacharacters used in regular expressions. Some of these metacharacters may conflict with important literals that you are trying to match with your regular expression. For example the period or dot symbol (.), the asterisk (*), the dollar sign ($), and back slash (\) have special meanings within regular expressions. Because one of these characters may be a key part of a particular text pattern you are looking for, you can "escape" these characters in your regular expression so that the regular expression processing treats them as literal characters rather than interpreting them as special metacharacters. To force any character to be interpreted as a literal rather than a metacharacter you add a back slash in front of that character. For example, if you wanted to find the string 4.99 on a Web page you might create a regular expression of /4.99/. While this will match the string 4.99 it would also match strings like 4599 and 4Q99 because of the special meaning of the period character. To have the regular expression interpret the period as a literal you need to escape the period with a forward slash as follows: /4\.99/. You can add the back slash escape character in front of any character to force the regular expression processing to interpret the character following the back slash as a literal. General you should use this syntax whenever you want to match any punctuation mark or other non-alphanumeric character.

Using Alternation

Alternation allows you to construct either/or matches where you know that one of two or more strings should appear in the content. The alternation character is the vertical pipe symbol: | . The vertical pipe is used to separate the alternate strings in the expression. For example the regular expression /(e-mail|e-mail|contact us)/ will succeed if the content contains any one of the three strings separated by the vertical pipes. The parentheses are used here to delimit alternations. In this example there are no patterns outside of the alternation that need to be matched. In contrast, a regular expression might be written as /(e-mail|e-mail|contact) us/. In this case, the match only succeeds when any of the three alternates enclosed in the parentheses is found followed immediately by a single white space and the word us. This is more restrictive than the previous example but also shows how the parentheses limit the alternation to the three words contained inside them. The match fails even if one or more of the alternates are found but the word us is not found as the next word.

Matching Patterns with Metacharacters

Often you will not know the exact text you need to match or the text pattern may vary from one session or from one day to another. Regular expressions have a number of special metacharacters used to define patterns and match whole categories of characters. While matching literal alphanumeric characters seems trivial, part of the power of regular expressions is the ability to match non-alphanumeric characters as well. Because of this it is important to keep in mind that your regular expressions need to account for the presence of non-alphanumeric characters in the content you are searching. This means that characters such as periods, commas, hyphens, quote marks and even white space need to be considered when constructing regular expressions.

Metacharacters Used in Regular Expressions

Metacharacter	Description
\s	Matches generic white space (that is, the Spacebar key). This metacharacter is particularly useful when combined with a quantifier to match varying numbers of white space positions that may occur between words that you are looking to match.
\S	Matches characters that are NOT white space. Note that the \S is capitalized versus the small \s used to match white space.
.	This is the period or dot character. Generally, it matches all characters which can be useful as well as frustrating. SiteScope considers the dot as a form of character class on its own and therefore it should not be included inside the square brackets of a character class.
\n	Matches the linefeed or newline character
\r	Matches a carriage return character
\w	Matches non-white space word characters, same result as what is matched by character class [A-Za-z0-9_]. It is important to note that the \w metacharacter matches the underscore character but not other punctuation marks such as hyphens, commas, periods, and so forth.

\W	Matches characters other than those matched by \w (lower case). This is particularly useful for matching punctuation marks and non-alphabetic characters such as ~!@#$%^&*()+={[}]:;and including the linefeed character, carriage return, and white space. It does not match the underscore character which is considered a word constituent matched by \w.
\d	Matches digits only. This is equivalent to the [0-9] character class
\D	Matches non-numeric characters (what \d does not match) plus other characters. Similar to \W but also matches on alphabetic characters. In SiteScope this will generally match everything, including multiple lines, until it encounters a digit.
\b	Requires that the match have a word boundary (usually a white space) at the position indicated by the \b
\B	Requires that the match NOT have a word boundary at the position indicated

Defining Character Classes

An important and very useful regular expression construct is known as a character class. Character classes provide a set of characters that may be found in a particular position within a regular expression. Character classes may be used to define a range of characters to match a single position or, with the addition of a quantifier, may be used to universally match multiple characters and even complete lines of text.

Character classes are formed by enclosing any combination of characters and metacharacters in square brackets: []. Character classes create an "any-or-all-of-these" group of character that may be matched. Unlike literals and metacharacters outside character classes, the physical sequence of characters and metacharacters within a character class have no effect on the search or match sequence. For example, the class [ABC0123abc] matches the same content as [0123abcABC].

The hyphen is used to further streamline character classes to indicate a range of letters or numbers. For example, the class [0-9] includes all digits from zero to nine inclusive. The class [a-z] includes all lower case letters from a to z. You can also create more restrictive classes with the hyphen such as [e-tE-T] to match upper or lower case letters from E to T or [0-5] to match digits from zero to five only.

The caret character (^) can be used within a character class as a negation or to exclude certain characters from a content match.

Example Character Classes

Example	Description
[a-zA-Z]	This matches any alphabetic character, both upper case and lower, from the letter a to the letter z. To match more than one character, append a quantifier after the character class as described below.
[0-9]	This matches any digit between 0 and 9. To match more than one digit, append a quantifier after the character class as described below.
[\w\s]	This matches any alphanumeric character and/or any white space.
[\w^_]	This matches any alphanumeric character but not the underscore.

Using Quantifiers

Another set of metacharacters used in regular expressions provide character counting options. This adds a great deal of power and flexibility in content matching. Quantifiers are appended after metacharacters and character classes described above to specify how positions the preceding match character or metacharacter should be matched against. For example in the regular expression /(contact|about)\s+us/, the metacharacter \s matches on white space. The plus sign quantifier following the \s means that there must be at least one but may be more than one white space between the words contact (or about) and us.

The following table describes the several quantifiers available for use in regular expressions. Quantifiers apply to the single character immediately preceding them. When used with character classes, the quantifier is placed outside the closing square bracket of the character class. For example: [a-z]+ or [0-9]*.

Quantifier	Description
?	The question mark means the preceding character or character class may appear once but is not required to appear in the position indicated. This is to say that the character is optional in the position indicated.
*	The asterisk requires that any number of the preceding character or character class appear in the designated position. This includes zero or more matches. Note: Care must be used in combining this quantifier with the dot (.) metacharacter or a character class including the \W metacharacter as these will likely "grab" more content than anticipated and cause the regular expression engine to use up all of the available CPU time on the SiteScope server.
+	The plus sign requires that the preceding character or character class appear at least once and possibly more than once.
{min,max}	Using curly braces creates a quantifier range. The range enumerator digits are separated by commas. This construct requires that the preceding character or character class appear at least as many times as specified by the min enumerator up to but no more than the the value of the max enumerator. The match succeeds as long as there are at least as many matches as specified by the `min` enumerator. However, the matching continues up to the number of times specified by the `max` enumerator or until no more matches are found.

It is important to know that in SiteScope the match content will be run against the entire HTTP response, including the HTTP header which is not normally viewable via the browser. The HTTP header usually contains several lines of text including words coupled with sequences of numbers. This may trip up some otherwise simple content matching on short sets of numbers and letters. To avoid this try to identify a unique sequence of characters near the text you are trying to match and include them as literals, where applicable, in the regular expression.

Search Mode Modifiers

Regular expressions used in SiteScope may include optional modifiers outside of the slashes used to delimit the expression. Modifiers after the ending slash affect the way the matching will be performed. For example and regular expression of /website/i with the i search modifier added will make the match content search insensitive to upper and lower case letters. This would match either website, Website, WEBSite, or even WEBSITE.

With the exception of the i modifier, some metacharacters and character classes can override search mode modifiers which can lead to frustration. In particular, the dot (.) and the \W metacharacters can override the m and s modifiers, matching content across multiple lines in spite of the modifier.

Regular Expression Match Mode Modifiers

Mode Modifier	Description
/i	Ignore case mode. This makes the search insensitive to upper case and lower case letters. This is a useful option especially when searching for matches in the text content of Web pages.
/c	The matched pattern must NOT appear anywhere in content that is being searched. This is a "complement" match meaning that an error is reported if the pattern IS found and succeeds if the pattern is NOT found.
/m	Match across multiple lines WITHOUT ignoring intervening carriage returns and linefeeds. With this modifier you may still need to account for possible linefeeds and carriage returns with a character class such as [\w\W]* or [\s\S\n\r]. The `.` will NOT match carriage returns or new line characters with this modifier.
/s	Consider the content as being on a single line, ignoring intervening carriage returns and new line characters. With this modifier both the [\w\W]* character class and the .* pattern will match across new lines and carriage returns.

More than one modifier can be added by concatenating them together after the closing slash of the regular expression. For example: /matchpattern/ic combines both the i and c modifiers.

Retaining Content Match Values

Some monitors, like the URL Monitor and URL Sequence Monitor, have a content match value that is logged and can be used to set error status thresholds. Another purpose of the parentheses /(match pattern)/ used in regular expression syntax is to determine what text is retained for the Content Match Value. You use this feature to use content match values directly as thresholds for determining a URL monitor's or URL Sequence monitor's error threshold will be.

For example, if the content match expression was

/Copyright (\d*)/

and the content returned to the monitor by the URL request included the string:

then the match is made and the retained content match value would be:

2003

Under the Error if option at the bottom of the monitor set up page you could then change the error-if condition from the default of status != 200(default) to content match then specify the relational operator as != and then specify the value 1998. This will set the error threshold for this monitor so that whenever the year in the string Copyright dddd is something other than 1998 the monitor will report an error. This mechanism could be used to watch for unauthorized content changes on Web pages.

Checking a Web page for links to other URL's can be an important part of constructing URL Sequence Monitors. The following regular expression can be used to match the URL text of a link on a web page:

/a href="?([:\/\w\s\d\.]*)"?/i

This expression matches the href="protocol://path/URLname.htm" for many URLs. The question mark modifiers allow the quote marks around the HREF= attribute to be optional. The i modifier allows the match pattern to be case-insensitive.

Retained or "remembered" values from content matches can be referenced and used as input for subsequent steps in a URL Sequence Monitor. See the Match Content section of the URL Sequence Monitor for the syntax used for Retaining and Passing Values Between Sequence Steps

SiteScope Date Variables

To make it easy to create expressions that match the current date or time, SiteScope uses several specially defined variables. These special variables are set by SiteScope to different parts of the current system date and time. They can be used in content match fields to find date coded content. The General Date Variables are useful for matching portions of various date formats. The Language/Country Specific Date Variables allow you to automatically extend the language used for month names and weekday names to specific countries based on ISO codes.

General Date Variables

The following table lists the general variables:

Variable	Range of Values
$hour$	0 - 23
$minute$	0 - 59
$month$	1 - 12
$day$	1 - 31
$year$	1000 - 9999
$shortYear$	00 - 99
$weekdayName$	Sun - Sat
$fullWeekdayName$	Sunday - Saturday
$0hour$	00 - 23
$0minute$	00 - 59
$0day$	01 - 31 (two digit day format)
$0month$	01 - 12 (two digit month format)
$monthName$	Jan - Dec (three letter month format in English)
$fullMonthName$	January - December
$ticks$	milliseconds since midnight, January 1, 1970

For example, if the content match search expression was defined as:

/Updated on $0month$\/$0day$\/$shortYear$/

and the content returned by the request includes the string:

Updated on 06/01/98

then the expression would match when the monitor is run on June 1st, 1998. The match fails if the content returned does not contain a string matching the current system date or the date format is different than the format specified.

If you want the time to be before or after the current time, you can add a $offsetMinutes=mmmm$ to the expression, and this will offset the current time by mmmm minutes (negative numbers are allowed for going backwards in time) before doing the substitutions.

For example, if the current day is June 1st, 1998, and the search expression is:

/$offsetMinutes=1440$Updated on $0month$\/$0day$\/$shortYear$/

the content string that would match would be:

Updated on 06/02/98

Note that the date is one day ahead of the system date.

Language/Country Specific Date Variables

The following table lists the SiteScope special variables for use with international day and month name matching. The characters LL and CC are placeholders for two letter ISO 639 language code characters and two letter ISO 3166 country code characters (see the notes below the table for more details).

Variable	Range of Values
$weekdayName_LL_CC$	Abbreviated weekday names for the language (LL) and country (CC) specified (see notes below).
$fullWeekdayName_LL_CC$	Full weekday names for the language (LL) and country (CC) specified.
$monthName_LL_CC$	Abbreviated month names for the language (LL) and country (CC) specified.
$fullMonthName_LL_CC$	Full month names for the language (LL) and country (CC) specified.

CC - an uppercase 2-character ISO-3166 country code. Examples are: DE for Germany, FR for France, CN for China, JP for Japan, BR for Brazil. You can find a full list of these codes at a number of Internet sites, such as: http://www.din.de/gremien/nas/nabd/iso3166ma/codlstp1/en_listp1.html
LL - a lowercase 2-character ISO-639 language code. Examples are: de for German, fr for French, zh for Chinese, ja for Japanese, pt for Portuguese. You can find a full list of these codes at a number of Internet sites, such as:
http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt or
http://www.dsv.su.se/~jpalme/ietf/language-codes.html

For example, if the content match expression was defined as:

/$fullWeekdayName_fr_FR$/i

and the content returned by the request includes the string:

mercredi

then this expression would match when the monitor was run on Wednesday.

If you are not concerned with the country-specific language variations, it is possible to use any of the above variables without including the country code. For example:

/$fullWeekdayName_fr$/

could be used to match the same content as /$fullWeekdayName_fr_FR$/.

Special Substitution for Monitor URL or File Path

SiteScope Date Variables are useful for matching content as part of a regular expression. The date variables can also be used as a special substitution to dynamically create URL's or file path names for specific monitors. This is useful for monitoring date coded files and directories where the URL or file path name is updated automatically based on system date information. One case would be that you have an application (SiteScope is one example) that creates date coded log files. The log file names include some form of the year, month, and day as part of the file name. An example would be a file name File2001_05_01.log where the year, month, and date are included.

Based on this example a new file would be created each day. In order to monitor the creation, size, or content of the current days file would normally require the file path name or URL of the monitor be manually changed each day. Using the SiteScope date variables and special substitution you can have SiteScope automatically update the file path to the current day's log file. By knowing the pattern used in naming the files you can construct a special substitution string similar to a regular expression that will substitute portions of the system date properties into the file path or URL.

For example if the absolute file path to the current day's log file in a file monitor is:

D:/Production/Webapps/Logs/File2001_05_01.log

the log file for the following day would be:

D:/Production/Webapps/Logs/File2001_05_02.log

you can construct a special substitution expression to automatically update the file path used by the monitor with the following syntax:

s/D:\/Production\/Webapps\/Logs\/File$year$_$0month$_$0date$.log/

The substitution requires that the expression start with a lower case s and that the expression is enclosed by forward slashes /.../. Forward slashes that are part of the file path must be escaped by adding the back slash (\) character as shown. The SiteScope date variables are separated by the underscore character literals. SiteScope checks the system time properties each time the monitor runs and substitutes to applicable values into the file path or URL before accessing the file.

SiteScope monitor types that support the special substitution are:

eBusiness chain
File Monitor
Log Monitor
URL Monitor
URL Sequence Monitor
Web server monitor

While the special substitution syntax is similar in syntax to the substitution syntax used in regular expressions, they are not the same. While all of the SiteScope date varaibles can be used in match content regular expressions, the special substitution discussed here can not be used as part of a match content expression.

Some Pitfalls in Working with Regular Expressions

The most significant problem that has been seen with regular expressions in SiteScope is the use of the .* construct. This match construct presents a very large number of possible matches on any page of content. The use of the .* construct is known to cause the regular expression matching engine used by SiteScope to take over all available CPU cycles on the SiteScope server. If this occurs, SiteScope will not be able to function and will have to be restarted each time the monitor with the offending regular expression is run until the expression has been corrected.

One key concept to remember it that regular expression matching is run against the entire text content returned to the SiteSope monitor. As mentioned above, this includes HTTP headers which are normally not viewable in the browser window (for example, the View->Source option). This also means that you need to account for other information that may not be displayed in the browser view. This includes text in META tags used by Internet search engines as well as client side scripts.

In the case of URL's which contain client side scripts, like Javascript, the text matching is done against the code lines of the script and not against the browser's output from the script. This means that if the script dynamically writes or replaces text on the Web page with values calculated by the script, it may not be possible to match this content with regular expressions. If the script is only changing text you may be able to match the corresponding text strings that appear in the script code. A further pitfall would be that you are trying to check the a certain condition was been met in the browser but the matching text string appears in the script content regardless of any user action.

It is also important to remember that a regular expression match succeeds as soon as the minimum match requested has been satisfied. After a match is made no further matching is performed. Therefore regular expressions are not well suited to count the number of occurences of a repeating text pattern. For example, if you want to check a Web page with a catalog list of items and each item has a link next to it saying Buy Now! and you wanted to make sure at least five items were listed, a regular expression of /Buy Now!/ would succeed in matching the first Buy Now! only. Likewise, if your regular expression is looking for the word catalog on the main browser screen, the match may succeed if the word appears a the META tag in the HTML header section or if it appears as a hyperlink in a site navigation menu that appears in the content before the occurrence you were intending to match.

Another pitfall is forgetting to account for non-alphanumeric content. As mentioned above, regular expressions need to be written to account for all of the characters that are and may be present. This includes white space, newline, and carriage returns. This is not normally a problem when matching a single word literal. It can be a challenge when you need to create a match several words separated by unknown amounts of white space and other non-alphanumeric characters and possibly span more than one line. The [\s\n\r]+ character class can be useful between words used in the expression. You should always check the format of the content you are trying got match to look for patterns and special characters, such as periods, commas, and hyphens, that may cause a seemingly simple match to fail.

The use of "greedy" metacharacters can lead to frustration. In some cases, overly generous quantifiers combined with the . or \W metacharacters can grab content that you were intending to match with a literal string elsewhere in your regular expression resulting in a match failure. For example, the following might be used to match the URL content of the hyperlink anchor reference: /a href="([\W\w\s]*)"/. When the monitor performs the check for this regular expression, however, the match grabs the first occurrence of the pattern /a href="... and continues matching multiple lines of text up to the last quote mark found on the page. Without some other unique ending delimiter, the [\W\w\s]* class and quantifier combination is too "greedy". A more successful syntax that narrows the class of expected characters would be: /a href="?([:\/\w\s\d\.]*)"?/

The following are some examples of syntax for use in regular expressions:

Example Expression	Description
/CUSTID\s?=\s?([A-Z0-9]{20,48})/	This example matches an ID string that is made of 20 or more digits and upper case letters with no spaces or other non-alphanumeric characters. The `\s?` construct allows there to be a white space on either side of the equal sign. Using the parentheses around the character class will instruct SiteScope to retain this value (up to the maximum of 48 characters) as a content match value and the matched value will be displayed in the monitor detail status column.
/a href="?([:\/\w\s\d\.]*)"?/i	This example matches the URL string in a HTML hyperlink. The "? construct makes a quote mark on either end of the URL string optional. Using the parentheses instructs SiteScope to retain this value as a content match value and the value will be displayed in the monitor status. The `i` modifier tells the search to treat upper and lower case letters as equal.
/"[^"]*"/	This example matches text sequences that are contained between quote marks. Note the use of the negation caret (^) to define a character class of all characters other than the quote mark.

As with programming and scripting languages, there is almost always more than one way to construct a regular expression to accomplish a particular match. There is not one right way to build regular expressions. You should plan to test and modify regular expressions as necessary until you get the results you need.

For an in-depth discussion of Perl regular expressions, you can consult a book about Perl programming, or find a Perl tutorial on the Web. More information on Perl expression is available online through www.perl.com