Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 27 (Aug 2001)
[suggested title: Understanding Regular Expressions]
One of the things that distinguishes Perl as a Powerful Practical tool in the Linux toolbox is Perl's ability to wrangle text in interesting ways that makes it seem effortless. And a majority of that ability can be attributed to Perl's very powerful regular expressions.
Regular expressions are nothing new. I was using Unix tools in 1977 with regular expressions, and I suspect they go back even further than that. But Perl continues to push the edge of how regular expressions work; so much so that the GNU project includes a ``perl-compatible-regular-expressions'' library (PCRE) so that other tools can catch up to Perl!
But before we get to the advanced stuff, let's quickly review the basics. A regular expression defines a template to match against a string. We say the string either matches, or doesn't match, the given regular expression, based on whether the string has properties that the regular expression demands.
For example, the regular expression /a/
demands that there be the
letter a
somewhere in the string. If there's a choice, and it
matters, it matches the leftmost ``a'' (this is called the ``leftmost
rule''). (We typically write the regular expression enclosed in
forward slashes, because that's the most common use in Perl, although
there are others.)
Of the ``atoms'' in a regular expression (from which the regular
expression is built), the most common are the ordinary characters
(such as the letter a
above), including any special characters
preceded by a backslash. And everything that works in a double-quoted
string also works in a regular expression, so \n
is a newline, and
\001
is a control-A, and so on.
But atoms also include character classes, such as [aeiou]
, looking
for any single vowel, and [^qwertyuiop]
, looking for any single
character not found on the top alphabetic row of the keyboard, and
[a-z]
indicating any one of the lowercase letters. Some character
classes have abbreviations, such as \s
for whitespace (and its
companion \S
for non-whitespace). And the very common .
means
match any single character except a newline character.
We can follow any of these atoms by a repetition operator, such as
*
(zero or more) or +
(one or more) or ?
(zero or one).
Given the opportunity, these repetition operators allow the string to
match the longest string possible while still letting the rest of the
regular expression match. We'll see an example or two of that later.
There's also the generalized repetition range operator, {m,n}
,
which looks for m through n instances of the atom it follows,
and the modified forms of {m,}
for ``m or more'' and {n}
for
``exactly n''.
Atoms, with or without a trailing repetition operator, are frequently
connected into sequences. The regular expression /pqr/
means look
for a p
immediately followed by a q
and then an r
. So
/d[aiou]g/
looks for dag
, dig
, dog
or dug
. And
/fre*ed/
will match freeeed
or freed
or even fred
, because
the repetition operator will ``back off'' after first matching the final
``e'' so that the non-repeated ``e'' gets a chance to match.
We'd typically write that last one as /fre+d/
though.
A sequence can also contain ``assertions''. The two most common assertions
are ``beginning of string'' as ^
or ``end of string'' as $
. So
/small$/
matches the word ``small'' only when it appears at the end
of the string, failing to match smalltalk
.
Sequences can be alternated with the |
operator, as in
/fred|barney/
matching either of those names, or the names
``manfred'' or ``redbarney'', since we can still match any substring and
be valid.
And finally, a regular expression enclosed in parentheses (called a
subexpression) can be considered an atom again, starting us over at
the top. So to apply a repetition to an alternation of two sequences,
we can say /(fred|barney)+/
, which means at least one fred or
barney, such as ``fred'' or ``barney'' or ``barneyfred'' or
``fredfredfredbarneyfred''.
Parentheses also serve to indicate memory. As the regular expression is being matched against the string, the contents of the string matching a subexpression (enclosed in parentheses) ends up in a memory. First subexpression yields the first memory, second subexpression yields the second memory, and so on.
We have two ways of accessing the memories. First, when the smoke
clears, the contents of the first subexpression are in the read-only
Perl variable called $1
, with the second in $2
and so on. So
after we match /abc(.*)def/
against ``abcGHIdef'', we'll have GHI
in $1
, until the next successful match.
The other way of accessing the memory (which is used less frequently,
but can still come in handy) is the backreference, where \1
is an
atom denoting the first memory already saved earlier with the same
regular expression. So /(['"])(.*)\1/
matches a single- or
double-quoted string, with $2
being the contents between the
quotes, and yet the quote marks have to be the same type of quotes.
Rare, but cool when you need it.
Occasionally, we'll need a set of parentheses that does not trigger a
memory. For that, we can use ?:
just inside the open parenthesis.
As an example, /([0-9]+(?:\.[0-9]+)?)\s+([a-z]+)/
means an integer
or floating point value followed by some whitespace and a lowercase
word. The number will be in $1
and the word will be in $2
, even
though we had to use a third set of parentheses to make the fractional
portion of the number optional.
Those are the basics, and will get you through about 90% of what you need for regular expressions. Let's look at how regular expressions are used now.
Perl's regular expression operators include the match, the substitute,
and split
. (I say ``include'' because I can't think of any others,
but I'm trying to be accurate.) A scalar match is the most common:
print "What small integer? "; chomp($_ = <STDIN>); if (/^(\d+)$/) { print "Good, you said $1!\n"; }
Note here that the match is against the contents of $_
by default, but
we can refer to any other value with the =~
operator. This code
does the same thing as the code above:
print "What small integer? "; if (<STDIN> =~ /^(\d+)$/) { print "Good, you said $1!\n"; }
Note that we didn't even need the chomp
here: an
often-misunderstood property of the $
assertion is that it matches
either right at the end of the string, or just before a newline at the
end of the string. As this can lead to security holes, I'm now
starting to include \z
instead of $
more often in my programs,
which says ``I absolutely want the end of the string here for this to
match''.
The substitute operator replaces a portion of a string (by default in
$_
, but you can change that using =~
) with another double-quoted
string on a successful match.
$_ = "hello, world"; s/hello/Hello/; # "Hello, world" s/(Hello, )/$1 Perl/; # "Hello, Perl world" s/(.*), (.*)/$2, $1!/; # "Perl world, Hello!"
Both the match and the substitute operator can use alternate delimiters (any other punctuation character) if the forward slash is troublesome: in particular when the regular expression or replacement have forward slashes:
my $filename = "/home/merlyn/.newsrc"; my $basename = $filename; $basename =~ s!.*/!!; # $basename now ".newsrc"
Note the use of !
as the delimiter here. We're guaranteed to match
down to the final slash, because the ``.*'' matches zero or more of
(nearly) any character, but the longest possible match that still lets
the rest of the regular expression match.
We successfully extracted the basename from that particular filename, but we blew it in the general case. Why? Because a newline character is valid in a Unix pathname. We need to match any possible characters before the final slash. We can do that with a character range:
$basename =~ s![\000-\377]*/!!;
or more simply by tagging a s
modifier onto the substitute:
$basename =~ s!.*/!!s;
The s
modifier changes .
so that it matches newlines as well,
and we now get any possible character there. Another useful modifier
is the case-insensitive modifier of i
. For example, /[aeiou]/i
finds any vowel in upper or lower case. Note that you can also write
that as /a|e|i|o|u/i
, but the character class version will be
considerably faster.
What if we had wanted to find the nearest slash instead of the
furthest slash? The easiest way is to tell the *
repetition
operator to be ``lazy'' instead of ``greedy''. Placing a ?
immediately
after a repetition operator tells it to take as few matches of that
atom as possible, instead of the greatest number. For example:
my $filename = "/home/bob/summary"; my $one = my $two = $filename; $one =~ s!/.*/!/etc/!; # "/etc/summary"; $two =~ s!/.*?/!/etc/!; # "/etc/bob/summary";
For $one
, we grabbed the first slash, as many characters as we could,
and then the final (third) slash.
But for $two
, we grabbed the first slash, as few characters as we
could, and then the next immediate slash (the second slash). Note
that this didn't find the ``shortest overall match'' as some people have
claimed incorrectly (which would have been ``/bob/'' rather than
``/home/''). It still starts with the first slash. This is similar to
how /([ab]+)/
will match the a's in ___aa___bbb___
, rather than
the (longer) sequence of B's. It's ``leftmost match first'' and then
the repetitions individually have biases towards ``longer matches'' (the
default) or ``shorter matches'' from that starting point.
The split
operator uses its regular expression to define a
``delimiter'', which is then found (usually multiple times) in a string.
Each match is discarded (leading one of my friends to call it the
``deliminator''), leaving us with the pieces of string left as the list
return value. So, a typical /etc/passwd
-style file is parsed with
relative ease:
my $line = "merlyn:x:904:100:Randal L. Schwartz:/home/merlyn:/bin/perl\n"; chomp $line; my @values = split /:/, $line;
Now @values
has seven elements, corresponding to the seven items
between the delimiters. If two colons were in a row, we'd get an empty
element in the list:
my @values = split /:/, "merlyn2::905:100::/home/merlyn2:/bin/perl";
Here, the second and fifth elements of @values
are empty. Had we
instead used /:+/
for the delimiter expression, those two
consecutive colons would have been considered one big fat delimiter,
and we'd have gotten five return values instead of seven.
This is typically desired when we are using whitespace as the
delimiter: we'll use /\s+/
for the expression, because generally a
hunk of whitespace in a row is a big fat delimiter, not many small
omitted items.
Sometimes, it's easier to specify what we keep instead of what we
throw away. For example, suppose I want to keep any integer or
floating point values in a line, discarding anything else that doesn't
look like a number. For that, we can use a match with a g
modifier
(for global) in a list context, which contributes $1
to a list
result for each match:
$_ = '12.24 dollars for 35 fish? Are you crazy?!'; my @hits = /([0-9]+(?:\.[0-9]+)?)/g;
Now @hits
will be 12.24
and 35
. We can pick out the following
words using the regular expression we presented earlier.
my @hits2 = /([0-9]+(?:\.[0-9]+)?)\s+([a-z]+)/g;
Now @hits2
is "12.24", "dollars", "35", "fish"
, because on each
match, we contribute the two memories to the result.
So, this is just a start, but I've run out of space. Some other
things to look up in the perlre
documentation (via perldoc
perlre
) include other assertions (such as lookahead and lookbehind
assertions), scalar use of the match g
modifier, creating regular
expressions from variables, using whitespace within the regular
expression to embed commentary, evaluating code during the
match, and so on. And you might check out the perlretut
page while
you're at it, which covers a lot of the same ground as what you've
just read, but in a different way. Hope this helps! Until next time,
enjoy!