Working with regular expressions in Free Pascal.
Regular expressions are a handy way to specify patterns of text. With regular expressions you can parse and validate user input, search for some patterns like links, emails, phone numbers, etc on a web page or in some document. This tutorial is not intended to give a complete description of regular expressions. Its first aim is to show how to work with regular expressions in a Free Pascal program.
Everyone programming with Perl knows about them. If you are a C programmer, you probably know them, too. But, as a Free Pascal programmer, did you know that there is a regular expressions unit available? In fact, it is not only available, but it's included with the default installation files of Lazarus/Free Pascal, so no need to download or install some supplementary staff.
The Free Pascal regular expressions unit is called RegExpr and you'll have to include it with your uses statement if you want to work with regex, how regular expressions are often called. The unit RegExpr defines a new class, called TRegExpr, and it's an object of this class and its properties and methods, that we will use to manage regular expressions in Free Pascal.
Performing a simple pattern matching (i.e. searching a string for a given text pattern) in Free Pascal requires the following steps:
- Create a TRegExpr object, using the TRegExpr.Create method.
- Assign the regular expression to be used to the property TRegExpr.Expression.
- Use the method TRegExpr.Exec to verify if there are any matches for this regular expression in the string passed as argument. The method returns a Boolean with value "True" if there have been any matches found.
- When done, call the method TRegExpr.Free to free the regular expression resources used.
Example: Suppose that our TRegExpr object is called "Re" and that we want to check if the string "S" contains either "allu", "aly", or "ali". In this simple case, we
could do three string comparisons, of course, but we can also use regular expressions as follows:
Re.Expression := 'allu|aly|ali';
if Re.Exec(S) then
Writeln('S contains the name searched for...');
Before viewing some practical examples, lets have a look at some of the features of regular expressions. For a full description, cf. this Regular expressions (RegEx) article.
Any single character (except special regex characters) matches itself. A series of (not special) characters matches that series of characters in the input string. Thus, the regex allu matches "allu" and nothing else. We can create a choice between several series of (not special) characters using the vertical bar character |, as seen above. The regex allu|aly|ali matches any of the 3 names and the Exec method would return "True" if its argument contains "ali", "False" if it contains "alli", but also "True" if it contains "alien", for example. The important point to remember is that pattern matching is about strings containing a given pattern, not about the string being equal to the pattern!
In more complex regex, you should put choices between parentheses; sometimes this is needed to define a subexpression. We could, for example, define the regex allu|aly|ali using a subexpression as al(lu|y|i), matching "al" plus either "lu", "y", or "i".
Regex use escape-codes for non-printable characters, like in C; examples: \n for linefeed and \r for carriage return, or \s for any space (space, tab, linefeed, carriage-return, formfeed). The backslash (\) is also used to define so-called meta-classes, that are pre-defined character classes, that allow to keep regex more compact. The probably most often used example is \d, that denotes any numeric character (0 to 9); this regex is the same as the character class regex [0-9], that we could also write as [0123456789]. Character classes, enclosed in square brackets, match any single of the characters specified. Examples: [a-z] matches any letter from a to z, [iueoa] matches any vowel, [\+\-\*\/] matches any of the four basic arithmetic operators (as these symbols have a special regex meaning, we have to escape them with a backslash!).
Note, that a caret (^) placed immediately after the opening bracket inverts the regex meaning; thus, [^iueoa] matches any character except the vowels (what does not mean that the string checked must not contain any vowels!).
An important special regex character is the dot character . that denotes "any character". Example: al. matches "aly" and "ali", but also "ball" and "balcony"; it does not match "bacon", nor does it match "bal" (because the dot indicates that there has to be another character after "al").
Character matches concern the match of a single character. The number of characters to be matched can be modified by using a quantifier. Quantifiers specify the number of repetitions of a character or a group of characters (defined as a subexpression). The three important quantifiers are:
- * means no, one, or several (any) times.
- + means one, or several (any) times.
- ? means no, or one time.
Examples: If in our example above, we use the regex al.*, it will also match "bal"; if we use al.?, it will match "aly", "ali", "bal", "ball", but not "balcony". More serious examples: \d+ may be used to search for a positive integer, -?\d+ for the search of any integer (positive or negative).
Pattern matching is about strings containing a given pattern, and this at any position within the string. We can create a regex that only matches at the beginning or the end of the string using the special regex characters ^ (must be the first character of the regex) resp. $ (must be the last character of the regex). Example: the regex ^(allu|aly) will match the string "allu wrote a great regex tutorial" and "aly wrote another great tutorial", but not "therefore allu wrote another tutorial".
These two special characters are particularly important because they allow us to check for a "string equals pattern match" (instead of the "string contains pattern" match). Using a regex starting with ^ and ending with $, pattern matching is about strings being equal to a given pattern. Examples: ^[iueoa]$ matches a single character that must be a vowel, ^[iueoa]+$ matches a string with one or more characters that are all vowels. And the important ones: ^\d+$ allows to check if a string contains only numbers (is a positive integer); ^-?\d+$ allows to check if a string is a (positive or negative) integer.
A last feature before passing to some example programs. When checking for a letter-based match, the check is done with or without case sensitivity; the default should normally be the corresponding setting in your system locale. The article mentioned above describes some ways to change the default. Another possibility is to specify case-sensitivity or case-insensitivity within the regex using an inline modifier. (?i) turns case insensitivity on, (?-i) turns it off. Examples (from the article mentioned above): (?i)Saint-Petersburg matches both "Saint-Petersburg and Saint-petersburg", (?i)Saint-(?-i)Petersburg matches "Saint-Petersburg" but not "Saint-petersburg". Also note that in the regex ((?i)Saint)-Petersburg the modifier only affects the group ((?i)Saint) and therefore will match "saint-Petersburg" but not "saint-petersburg".
Program sample 1: Checking if a string is numeric.
The program sample regex1 checks if a string is an integer (an if so, if it is a positive integer), a floating point number, or not numeric. Positive integers are supposed to be all numeric digits, integers to be all numeric digits preceded or not by a minus sign. Floating point numbers are supposed to have at least one digit before and after the decimal separator, that has to be determined for the actual system locale when creating the regular expression. Here is the code:
program regex1;
uses
SysUtils, RegExpr;
const
Strings: array[0..9] of string = (
'123', '-123', '123-', '--123', '12.3', '12,3', '1,2,3', '-12.3', '-12,3', '-123,'
);
var
I: Integer;
IsInteger, IsPosInteger, IsFloat: Boolean;
Regex: TRegExpr;
function DecimalSeparator: Char;
var
S: string;
begin
S := FloatToStr(1.2);
Result := Copy(S, 2, 1)[1];
end;
begin
Regex := TRegExpr.Create;
Writeln; Writeln('Test if an expression is numeric.');
for I := 0 to 9 do begin
Regex.Expression := '^-?\d+$';
IsInteger := Regex.Exec(Strings[I]);
if IsInteger then begin
Regex.Expression := '^\d+$';
IsPosInteger := Regex.Exec(Strings[I]);
end
else begin
Regex.Expression := '^-?\d+' + DecimalSeparator + '\d+$';
IsFloat := Regex.Exec(Strings[I]);
end;
Write(Strings[I]:7);
if IsInteger or IsFloat then begin
Write(' is a number; ');
if IsInteger then begin
if IsPosInteger then
Writeln('it is a positive integer.')
else
Writeln('it is a (positive or negative) integer.')
end
else begin
Writeln('it is a floating point number.');
end;
end
else begin
Writeln(' is NOT a valid number!');
end;
end;
Regex.Free;
end.
The screenshot shows the program output.
Program sample 2: Checking if a string contains vowels.
The program sample regex2 checks if a string contains vowels (an if so, if it is all vowels, or if the first and/or last character is a vowel, as well as if among the vowels there are uppercase ones). Here is the code:
program regex2;
uses
SysUtils, RegExpr;
const
Words: array[0..8] of string = (
'Any', 'any', 'many', 'NY', 'Ali', 'Bali', 'aua', 'AUA', 'Aua'
);
var
I: Integer;
Contains, Upper, First, Last, All, AllUpper: Boolean;
Regex: TRegExpr;
begin
Regex := TRegExpr.Create;
Writeln; Writeln('Test if an expression contains vowels.');
for I := 0 to 8 do begin
Regex.Expression := '(?i)[iueoa]';
Contains := Regex.Exec(Words[I]);
if Contains then begin
Regex.Expression := '^(?i)[iueoa]';
First := Regex.Exec(Words[I]);
Regex.Expression := '(?i)[iueoa]$';
Last := Regex.Exec(Words[I]);
Regex.Expression := '^(?i)[iueoa]+$';
All := Regex.Exec(Words[I]);
Regex.Expression := '[IUEOA]';
Upper := Regex.Exec(Words[I]);
if Upper then begin
Regex.Expression := '^[IUEOA]+$';
AllUpper := Regex.Exec(Words[I]);
end;
end;
Write(Words[I]:6);
if Contains then begin
if All then begin
if AllUpper then
Writeln(' contains exclusively uppercase vowels.')
else if Upper then
Writeln(' contains exclusively (uppercase and lowercase) vowels.')
else
Writeln(' contains exclusively lowercase vowels.');
end
else begin
if Upper then
Write(' contains one or more (uppercase or lowercase) vowels; ')
else
Write(' contains one or more (lowercase) vowels; ');
if First and Last then
Writeln('the first and last letter are vowels.')
else if First then
Writeln('the first letter is a vowel.')
else if Last then
Writeln('the last letter is a vowel.')
else
Writeln('neither the first, nor the last letter are vowels.');
end;
end
else begin
Writeln(' does NOT contain any vowels.');
end;
end;
Regex.Free;
end.
The screenshot shows the program output.
String parsing and multiple match occurrences.
Regular expressions may not only be used to check if a string contains a given pattern (resp. is equal to a given pattern), but are also useful to find multiple occurrences of a given pattern. Example: Find all integer numbers and extract them to Integer variables.
Performing a multiple occurrences pattern matching with extraction of the actual values in Free Pascal requires the following steps:
- Create a TRegExpr object, using the TRegExpr.Create method.
- Assign the regular expression to be used to the property TRegExpr.Expression. Use parentheses to define a subexpression for each pattern to be used to extract a value.
- Use the method TRegExpr.Match[<index>] to extract the first value matching the pattern. The index used here corresponds to the position of the subexpression group; groups are numbered from left to right, and starting with 1.
- To extract the following values matching the same pattern, use the method TRegExpr.ExecNext. Extract the actual value as before, and continue to do all this as long as there is a match.
- When finished, call the method TRegExpr.Free to free the regular expression resources used.
Example (modified from the RegEx packages article in the Free Pascal wiki): Suppose that our
TRegExpr object is called "Re" and that we want to extract all names from a string that contains multiple "hello <name>!" substrings. If the string "S" is, for
example, 'hello Aly! hello Pascal!', to successively extract 'Aly', then 'Pascal', we can use regular expressions as follows:
Re := TRegExpr.Create('hello (.*?)!');
if Re.Exec(S) then begin
Writeln(Re.Match[1]);
while Re.ExecNext do
Writeln(Re.Match[1]);
end;
The important point here is the subexpression defined with help of the parentheses. It's the match with the pattern .*? that will be extracted with Re.Match[1] (the index to be used is 1, as the subexpression is the only, thus the first, group defined).
Program sample 3: A simple arithmetic expression parser.
The program sample regex3 illustrates the extraction of multiple occurrences matches. The user is asked for an arithmetic expression (to be evaluated by the program), supposed to be of the form <number><operator><number>[<operator><number>...], where number is a positive integer and operator is one of the four basic arithmetic operators +, -, *, or / (in this last case an integer division being performed). Here is the code:
program regex3;
uses
SysUtils, RegExpr;
var
N: Integer;
Expr, Mess: string;
Regex: TRegExpr;
procedure Calculate(Op: string; var N1: Integer; N2: Integer; out Mess: string);
begin
Mess := '';
case Op[1] of
'+': N1 += N2;
'-': N1 -= N2;
'*': N1 *= N2;
'/': begin
if N2 = 0 then
Mess := 'Division by zero'
else
N1 := N1 div N2
end;
end;
end;
begin
Regex := TRegExpr.Create;
Writeln; Writeln('Parse an arithmetic expression (positive integers).'); Writeln;
repeat
Write('Enter expression (ENTER to terminate)? '); Readln(Expr);
Regex.Expression := '^(\d+)([\+\-\*\/]\d+)+$';
if Regex.Exec(Expr) then begin
N := StrToInt(Regex.Match[1]);
Regex.Expression := '([\+\-\*\/])(\d+)';
if Regex.Exec(Expr) then begin
Calculate(Regex.Match[1], N, StrToInt(Regex.Match[2]), Mess);
while (Mess = '') and Regex.ExecNext do
Calculate(Regex.Match[1], N, StrToInt(Regex.Match[2]), Mess);
end;
if Mess = '' then
Writeln(' Result of arithmetic expression is ', N)
else
Writeln(' Input error: ', Mess, '!');
end
else begin
if Expr <> '' then
Writeln(' This is not a positive integer arithmetic expression!');
end;
Writeln;
until Expr = '';
Regex.Free;
end.
The program asks for an arithmetic expression until the user enters an empty string (just hits ENTER). We first check if the string actually is an arithmetic expression (defined by the format described above). We then extract the first operand using the subexpression (group 1) of the initial regex. The first operand having been extracted, all subsequent matches will include an operator and a further operand. The corresponding regex is ([\+\-\*\/])(\d+). Note the definition of two subexpressions, using two pairs of parentheses. The first one is for the extraction of the operator (we'll get it with Regex.Match[1]; the second one is for the operand (we'll get it with Regex.Match[2]). We have now all that we need to perform the arithmetic calculation of the two numbers N1 and N2. Storing the result in N1, we continue the loop, extracting a further operator and a further operand N2, that we add/subtract/multiply/divide (in)to N1. Continuing this until there isn't a match anymore (or until there was a division by zero error), we'll get the result of the arithmetic expression in N1 and can print it out (unless there was an error).
The screenshot shows an example of the program output.
Click the following link to download the sources of the 3 program samples.
If you find this text helpful, please, support me and this website by signing my guestbook.