Using Regular ExpressionsSeveral SiteScope monitors allow for content matching on the text returned from the monitor's request or action (see the documentation for each monitor for applicability). This adds an important level of functionality to SiteScope SiteScope makes use of regular expressions to match text content. Regular expressions is a name given to a text parsing tool that was developed for use with scripting languages such as Awk and Perl as well as several programming environments such as Emacs, Visual C++, and Java. Regular expressions themselves are not a programming language. They do make use of many special combinations of characters and symbols which often make them more difficult to interpret than some programming languages. The many different combinations of these special characters, known as metacharacters, make regular expressions a very powerful and flexible tool for parsing and isolating specific text within a larger body of text. Including a regular expression in the Match Content text box of a monitor instructs SiteScope to parse the text returned to the monitor when it is runs and look for content that satisfies the pattern defined by the regular expression. This document presents an overview of the syntax and metacharacters used in regular expressions for use in matching content for SiteScope monitors.
Defining a Regular ExpressionThe first step in building a match content expression in SiteScope is use of the forward slash: /. Entries in the Match Content text box of a SiteScope monitor must start and end with a forward slash / to be recognized as regular expressions. For example entering the expression /website/ into the Match Content box of a monitor would instruct SiteScope to search through the text content received by the monitor to find the text string: website. If a match is not found, the monitor reports an error status. When a match is found, the monitor reports a good status as long as all other monitor conditions are also met. If you enter text or other characters into the Match Content box without delimiting the entry with forward slashes will either be ignored or reported as a content match error by SiteScope. Adding parentheses ( ) within the forward slashes surrounding the regular expression is another very useful feature for regular expressions in SiteScope. The parentheses are used to create what is called a "back reference". As a back reference SiteScope retains what was matched between the parentheses and displays the text in the Status field of the monitor detail page. This is very useful for troubleshooting match content. This is also a way to pass a matched value from one monitor to another or from one step of a URL Sequence Monitor to the next step of the same transaction. Parentheses are also used to limit alternations which are discussed below. Matching String LiteralsFinding and matching an exact or literal string is the simplest form of pattern matching with regular expressions. In matching literals, regular expressions behave much as they do in a search/replace feature in word processing applications. The example above matched the text website. The regular expression /Buy Now/ will succeed if the text returned to the monitor contains the characters Buy Now, including the space, in that order. It is important to note that regular expressions are, by default, case-sensitive and literal. That means that the content must match the expression in case and order, including non-alphanumeric characters. For example a regular expression of /website/, without any modifiers, will only succeed if the content contains the string website exactly but will fail even if the content is Website, WEBSITE, or Web site. (In the last case the match fails because of the space between the two words.) There are cases where you may want to literally match certain non-alphanumeric characters which are "reserved" metacharacters used in regular expressions. Some of these metacharacters may conflict with important literals that you are trying to match with your regular expression. For example the period or dot symbol (.), the asterisk (*), the dollar sign ($), and back slash (\) have special meanings within regular expressions. Because one of these characters may be a key part of a particular text pattern you are looking for, you can "escape" these characters in your regular expression so that the regular expression processing treats them as literal characters rather than interpreting them as special metacharacters. To force any character to be interpreted as a literal rather than a metacharacter you add a back slash in front of that character. For example, if you wanted to find the string 4.99 on a Web page you might create a regular expression of /4.99/. While this will match the string 4.99 it would also match strings like 4599 and 4Q99 because of the special meaning of the period character. To have the regular expression interpret the period as a literal you need to escape the period with a forward slash as follows: /4\.99/. You can add the back slash escape character in front of any character to force the regular expression processing to interpret the character following the back slash as a literal. General you should use this syntax whenever you want to match any punctuation mark or other non-alphanumeric character. Using AlternationAlternation allows you to construct either/or matches where you know that one of two or more strings should appear in the content. The alternation character is the vertical pipe symbol: | . The vertical pipe is used to separate the alternate strings in the expression. For example the regular expression /(e-mail|e-mail|contact us)/ will succeed if the content contains any one of the three strings separated by the vertical pipes. The parentheses are used here to delimit alternations. In this example there are no patterns outside of the alternation that need to be matched. In contrast, a regular expression might be written as /(e-mail|e-mail|contact) us/. In this case, the match only succeeds when any of the three alternates enclosed in the parentheses is found followed immediately by a single white space and the word us. This is more restrictive than the previous example but also shows how the parentheses limit the alternation to the three words contained inside them. The match fails even if one or more of the alternates are found but the word us is not found as the next word. Matching Patterns with MetacharactersOften you will not know the exact text you need to match or the text pattern may vary from one session or from one day to another. Regular expressions have a number of special metacharacters used to define patterns and match whole categories of characters. While matching literal alphanumeric characters seems trivial, part of the power of regular expressions is the ability to match non-alphanumeric characters as well. Because of this it is important to keep in mind that your regular expressions need to account for the presence of non-alphanumeric characters in the content you are searching. This means that characters such as periods, commas, hyphens, quote marks and even white space need to be considered when constructing regular expressions. Metacharacters Used in Regular Expressions
Defining Character ClassesAn important and very useful regular expression construct is known as a character class. Character classes provide a set of characters that may be found in a particular position within a regular expression. Character classes may be used to define a range of characters to match a single position or, with the addition of a quantifier, may be used to universally match multiple characters and even complete lines of text. Character classes are formed by enclosing any combination of characters and metacharacters in square brackets: []. Character classes create an "any-or-all-of-these" group of character that may be matched. Unlike literals and metacharacters outside character classes, the physical sequence of characters and metacharacters within a character class have no effect on the search or match sequence. For example, the class [ABC0123abc] matches the same content as [0123abcABC]. The hyphen is used to further streamline character classes to indicate a range of letters or numbers. For example, the class [0-9] includes all digits from zero to nine inclusive. The class [a-z] includes all lower case letters from a to z. You can also create more restrictive classes with the hyphen such as [e-tE-T] to match upper or lower case letters from E to T or [0-5] to match digits from zero to five only. The caret character (^) can be used within a character class as a negation or to exclude certain characters from a content match. Example Character Classes
Using QuantifiersAnother set of metacharacters used in regular expressions provide character counting options. This adds a great deal of power and flexibility in content matching. Quantifiers are appended after metacharacters and character classes described above to specify how positions the preceding match character or metacharacter should be matched against. For example in the regular expression /(contact|about)\s+us/, the metacharacter \s matches on white space. The plus sign quantifier following the \s means that there must be at least one but may be more than one white space between the words contact (or about) and us. The following table describes the several quantifiers available for use in regular expressions. Quantifiers apply to the single character immediately preceding them. When used with character classes, the quantifier is placed outside the closing square bracket of the character class. For example: [a-z]+ or [0-9]*.
It is important to know that in SiteScope the match content will be run against the entire HTTP response, including the HTTP header which is not normally viewable via the browser. The HTTP header usually contains several lines of text including words coupled with sequences of numbers. This may trip up some otherwise simple content matching on short sets of numbers and letters. To avoid this try to identify a unique sequence of characters near the text you are trying to match and include them as literals, where applicable, in the regular expression. Search Mode ModifiersRegular expressions used in SiteScope may include optional modifiers outside of the slashes used to delimit the expression. Modifiers after the ending slash affect the way the matching will be performed. For example and regular expression of /website/i with the i search modifier added will make the match content search insensitive to upper and lower case letters. This would match either website, Website, WEBSite, or even WEBSITE. With the exception of the i modifier, some metacharacters and character classes can override search mode modifiers which can lead to frustration. In particular, the dot (.) and the \W metacharacters can override the m and s modifiers, matching content across multiple lines in spite of the modifier. Regular Expression Match Mode Modifiers
More than one modifier can be added by concatenating them together after the closing slash of the regular expression. For example: /matchpattern/ic combines both the i and c modifiers. Retaining Content Match ValuesSome monitors, like the URL Monitor and URL Sequence Monitor, have a content match value that is logged and can be used to set error status thresholds. Another purpose of the parentheses /(match pattern)/ used in regular expression syntax is to determine what text is retained for the Content Match Value. You use this feature to use content match values directly as thresholds for determining a URL monitor's or URL Sequence monitor's error threshold will be. For example, if the content match expression was /Copyright (\d*)/ and the content returned to the monitor by the URL request included the string: ... Copyright 2003 by Mercury Interactive then the match is made and the retained content match value would be: 2003 Under the Error if option at the bottom of the monitor set up page you could then change the error-if condition from the default of status != 200(default) to content match then specify the relational operator as != and then specify the value 1998. This will set the error threshold for this monitor so that whenever the year in the string Copyright dddd is something other than 1998 the monitor will report an error. This mechanism could be used to watch for unauthorized content changes on Web pages. Checking a Web page for links to other URL's can be an important part of constructing URL Sequence Monitors. The following regular expression can be used to match the URL text of a link on a web page: /a href="?([:\/\w\s\d\.]*)"?/i This expression matches the href="protocol://path/URLname.htm" for many URLs. The question mark modifiers allow the quote marks around the HREF= attribute to be optional. The i modifier allows the match pattern to be case-insensitive. Retained or "remembered" values from content matches can be referenced and used as input for subsequent steps in a URL Sequence Monitor. See the Match Content section of the URL Sequence Monitor for the syntax used for Retaining and Passing Values Between Sequence Steps SiteScope Date VariablesTo make it easy to create expressions that match the current date or time, SiteScope uses several specially defined variables. These special variables are set by SiteScope to different parts of the current system date and time. They can be used in content match fields to find date coded content. The General Date Variables are useful for matching portions of various date formats. The Language/Country Specific Date Variables allow you to automatically extend the language used for month names and weekday names to specific countries based on ISO codes. General Date VariablesThe following table lists the general variables:
For example, if the content match search expression was defined as: /Updated on $0month$\/$0day$\/$shortYear$/ and the content returned by the request includes the string: Updated on 06/01/98 then the expression would match when the monitor is run on June 1st, 1998. The match fails if the content returned does not contain a string matching the current system date or the date format is different than the format specified. If you want the time to be before or after the current time, you can add a $offsetMinutes=mmmm$ to the expression, and this will offset the current time by mmmm minutes (negative numbers are allowed for going backwards in time) before doing the substitutions. For example, if the current day is June 1st, 1998, and the search expression is: /$offsetMinutes=1440$Updated on $0month$\/$0day$\/$shortYear$/ the content string that would match would be: Updated on 06/02/98 Note that the date is one day ahead of the system date. Language/Country Specific Date VariablesThe following table lists the SiteScope special variables for use with international day and month name matching. The characters LL and CC are placeholders for two letter ISO 639 language code characters and two letter ISO 3166 country code characters (see the notes below the table for more details).
CC - an uppercase 2-character ISO-3166 country code. Examples are:
DE for Germany, FR for France, CN for China,
JP for Japan, BR for Brazil. You can find a full list of
these codes at a number of Internet sites, such as: http://www.din.de/gremien/nas/nabd/iso3166ma/codlstp1/en_listp1.html For example, if the content match expression was defined as: /$fullWeekdayName_fr_FR$/i and the content returned by the request includes the string: mercredi then this expression would match when the monitor was run on Wednesday. If you are not concerned with the country-specific language variations, it is possible to use any of the above variables without including the country code. For example: /$fullWeekdayName_fr$/ could be used to match the same content as /$fullWeekdayName_fr_FR$/. Special Substitution for Monitor URL or File PathSiteScope Date Variables are useful for matching content as part of a regular expression. The date variables can also be used as a special substitution to dynamically create URL's or file path names for specific monitors. This is useful for monitoring date coded files and directories where the URL or file path name is updated automatically based on system date information. One case would be that you have an application (SiteScope is one example) that creates date coded log files. The log file names include some form of the year, month, and day as part of the file name. An example would be a file name File2001_05_01.log where the year, month, and date are included. Based on this example a new file would be created each day. In order to monitor the creation, size, or content of the current days file would normally require the file path name or URL of the monitor be manually changed each day. Using the SiteScope date variables and special substitution you can have SiteScope automatically update the file path to the current day's log file. By knowing the pattern used in naming the files you can construct a special substitution string similar to a regular expression that will substitute portions of the system date properties into the file path or URL. For example if the absolute file path to the current day's log file in a file monitor is: D:/Production/Webapps/Logs/File2001_05_01.log the log file for the following day would be: D:/Production/Webapps/Logs/File2001_05_02.log you can construct a special substitution expression to automatically update the file path used by the monitor with the following syntax: s/D:\/Production\/Webapps\/Logs\/File$year$_$0month$_$0date$.log/ The substitution requires that the expression start with a lower case s and that the expression is enclosed by forward slashes /.../. Forward slashes that are part of the file path must be escaped by adding the back slash (\) character as shown. The SiteScope date variables are separated by the underscore character literals. SiteScope checks the system time properties each time the monitor runs and substitutes to applicable values into the file path or URL before accessing the file. SiteScope monitor types that support the special substitution are:
While the special substitution syntax is similar in syntax to the substitution syntax used in regular expressions, they are not the same. While all of the SiteScope date varaibles can be used in match content regular expressions, the special substitution discussed here can not be used as part of a match content expression. Some Pitfalls in Working with Regular ExpressionsThe most significant problem that has been seen with regular expressions in SiteScope is the use of the .* construct. This match construct presents a very large number of possible matches on any page of content. The use of the .* construct is known to cause the regular expression matching engine used by SiteScope to take over all available CPU cycles on the SiteScope server. If this occurs, SiteScope will not be able to function and will have to be restarted each time the monitor with the offending regular expression is run until the expression has been corrected. One key concept to remember it that regular expression matching is run against the entire text content returned to the SiteSope monitor. As mentioned above, this includes HTTP headers which are normally not viewable in the browser window (for example, the View->Source option). This also means that you need to account for other information that may not be displayed in the browser view. This includes text in META tags used by Internet search engines as well as client side scripts. In the case of URL's which contain client side scripts, like Javascript, the text matching is done against the code lines of the script and not against the browser's output from the script. This means that if the script dynamically writes or replaces text on the Web page with values calculated by the script, it may not be possible to match this content with regular expressions. If the script is only changing text you may be able to match the corresponding text strings that appear in the script code. A further pitfall would be that you are trying to check the a certain condition was been met in the browser but the matching text string appears in the script content regardless of any user action. It is also important to remember that a regular expression match succeeds as soon as the minimum match requested has been satisfied. After a match is made no further matching is performed. Therefore regular expressions are not well suited to count the number of occurences of a repeating text pattern. For example, if you want to check a Web page with a catalog list of items and each item has a link next to it saying Buy Now! and you wanted to make sure at least five items were listed, a regular expression of /Buy Now!/ would succeed in matching the first Buy Now! only. Likewise, if your regular expression is looking for the word catalog on the main browser screen, the match may succeed if the word appears a the META tag in the HTML header section or if it appears as a hyperlink in a site navigation menu that appears in the content before the occurrence you were intending to match. Another pitfall is forgetting to account for non-alphanumeric content. As mentioned above, regular expressions need to be written to account for all of the characters that are and may be present. This includes white space, newline, and carriage returns. This is not normally a problem when matching a single word literal. It can be a challenge when you need to create a match several words separated by unknown amounts of white space and other non-alphanumeric characters and possibly span more than one line. The [\s\n\r]+ character class can be useful between words used in the expression. You should always check the format of the content you are trying got match to look for patterns and special characters, such as periods, commas, and hyphens, that may cause a seemingly simple match to fail. The use of "greedy" metacharacters can lead to frustration. In some cases, overly generous quantifiers combined with the . or \W metacharacters can grab content that you were intending to match with a literal string elsewhere in your regular expression resulting in a match failure. For example, the following might be used to match the URL content of the hyperlink anchor reference: /a href="([\W\w\s]*)"/. When the monitor performs the check for this regular expression, however, the match grabs the first occurrence of the pattern /a href="... and continues matching multiple lines of text up to the last quote mark found on the page. Without some other unique ending delimiter, the [\W\w\s]* class and quantifier combination is too "greedy". A more successful syntax that narrows the class of expected characters would be: /a href="?([:\/\w\s\d\.]*)"?/ The following are some examples of syntax for use in regular expressions:
As with programming and scripting languages, there is almost always more than one way to construct a regular expression to accomplish a particular match. There is not one right way to build regular expressions. You should plan to test and modify regular expressions as necessary until you get the results you need. For an in-depth discussion of Perl regular expressions, you can consult a book about Perl programming, or find a Perl tutorial on the Web. More information on Perl expression is available online through www.perl.com
|