Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 27 (Aug 2001)

[suggested title: Understanding Regular Expressions]

One of the things that distinguishes Perl as a Powerful Practical tool in the Linux toolbox is Perl's ability to wrangle text in interesting ways that makes it seem effortless. And a majority of that ability can be attributed to Perl's very powerful regular expressions.

Regular expressions are nothing new. I was using Unix tools in 1977 with regular expressions, and I suspect they go back even further than that. But Perl continues to push the edge of how regular expressions work; so much so that the GNU project includes a ``perl-compatible-regular-expressions'' library (PCRE) so that other tools can catch up to Perl!

But before we get to the advanced stuff, let's quickly review the basics. A regular expression defines a template to match against a string. We say the string either matches, or doesn't match, the given regular expression, based on whether the string has properties that the regular expression demands.

For example, the regular expression /a/ demands that there be the letter a somewhere in the string. If there's a choice, and it matters, it matches the leftmost ``a'' (this is called the ``leftmost rule''). (We typically write the regular expression enclosed in forward slashes, because that's the most common use in Perl, although there are others.)

Of the ``atoms'' in a regular expression (from which the regular expression is built), the most common are the ordinary characters (such as the letter a above), including any special characters preceded by a backslash. And everything that works in a double-quoted string also works in a regular expression, so \n is a newline, and \001 is a control-A, and so on.

But atoms also include character classes, such as [aeiou], looking for any single vowel, and [^qwertyuiop], looking for any single character not found on the top alphabetic row of the keyboard, and [a-z] indicating any one of the lowercase letters. Some character classes have abbreviations, such as \s for whitespace (and its companion \S for non-whitespace). And the very common . means match any single character except a newline character.

We can follow any of these atoms by a repetition operator, such as * (zero or more) or + (one or more) or ? (zero or one). Given the opportunity, these repetition operators allow the string to match the longest string possible while still letting the rest of the regular expression match. We'll see an example or two of that later.

There's also the generalized repetition range operator, {m,n}, which looks for m through n instances of the atom it follows, and the modified forms of {m,} for ``m or more'' and {n} for ``exactly n''.

Atoms, with or without a trailing repetition operator, are frequently connected into sequences. The regular expression /pqr/ means look for a p immediately followed by a q and then an r. So /d[aiou]g/ looks for dag, dig, dog or dug. And /fre*ed/ will match freeeed or freed or even fred, because the repetition operator will ``back off'' after first matching the final ``e'' so that the non-repeated ``e'' gets a chance to match. We'd typically write that last one as /fre+d/ though.

A sequence can also contain ``assertions''. The two most common assertions are ``beginning of string'' as ^ or ``end of string'' as $. So /small$/ matches the word ``small'' only when it appears at the end of the string, failing to match smalltalk.

Sequences can be alternated with the | operator, as in /fred|barney/ matching either of those names, or the names ``manfred'' or ``redbarney'', since we can still match any substring and be valid.

And finally, a regular expression enclosed in parentheses (called a subexpression) can be considered an atom again, starting us over at the top. So to apply a repetition to an alternation of two sequences, we can say /(fred|barney)+/, which means at least one fred or barney, such as ``fred'' or ``barney'' or ``barneyfred'' or ``fredfredfredbarneyfred''.

Parentheses also serve to indicate memory. As the regular expression is being matched against the string, the contents of the string matching a subexpression (enclosed in parentheses) ends up in a memory. First subexpression yields the first memory, second subexpression yields the second memory, and so on.

We have two ways of accessing the memories. First, when the smoke clears, the contents of the first subexpression are in the read-only Perl variable called $1, with the second in $2 and so on. So after we match /abc(.*)def/ against ``abcGHIdef'', we'll have GHI in $1, until the next successful match.

The other way of accessing the memory (which is used less frequently, but can still come in handy) is the backreference, where \1 is an atom denoting the first memory already saved earlier with the same regular expression. So /(['"])(.*)\1/ matches a single- or double-quoted string, with $2 being the contents between the quotes, and yet the quote marks have to be the same type of quotes. Rare, but cool when you need it.

Occasionally, we'll need a set of parentheses that does not trigger a memory. For that, we can use ?: just inside the open parenthesis. As an example, /([0-9]+(?:\.[0-9]+)?)\s+([a-z]+)/ means an integer or floating point value followed by some whitespace and a lowercase word. The number will be in $1 and the word will be in $2, even though we had to use a third set of parentheses to make the fractional portion of the number optional.

Those are the basics, and will get you through about 90% of what you need for regular expressions. Let's look at how regular expressions are used now.

Perl's regular expression operators include the match, the substitute, and split. (I say ``include'' because I can't think of any others, but I'm trying to be accurate.) A scalar match is the most common:

  print "What small integer? ";
  chomp($_ = <STDIN>);
  if (/^(\d+)$/) {
    print "Good, you said $1!\n";
  }

Note here that the match is against the contents of $_ by default, but we can refer to any other value with the =~ operator. This code does the same thing as the code above:

  print "What small integer? ";
  if (<STDIN> =~ /^(\d+)$/) {
    print "Good, you said $1!\n";
  }

Note that we didn't even need the chomp here: an often-misunderstood property of the $ assertion is that it matches either right at the end of the string, or just before a newline at the end of the string. As this can lead to security holes, I'm now starting to include \z instead of $ more often in my programs, which says ``I absolutely want the end of the string here for this to match''.

The substitute operator replaces a portion of a string (by default in $_, but you can change that using =~) with another double-quoted string on a successful match.

  $_ = "hello, world";
  s/hello/Hello/; # "Hello, world"
  s/(Hello, )/$1 Perl/; # "Hello, Perl world"
  s/(.*), (.*)/$2, $1!/; # "Perl world, Hello!"

Both the match and the substitute operator can use alternate delimiters (any other punctuation character) if the forward slash is troublesome: in particular when the regular expression or replacement have forward slashes:

  my $filename = "/home/merlyn/.newsrc";
  my $basename = $filename;
  $basename =~ s!.*/!!; # $basename now ".newsrc"

Note the use of ! as the delimiter here. We're guaranteed to match down to the final slash, because the ``.*'' matches zero or more of (nearly) any character, but the longest possible match that still lets the rest of the regular expression match.

We successfully extracted the basename from that particular filename, but we blew it in the general case. Why? Because a newline character is valid in a Unix pathname. We need to match any possible characters before the final slash. We can do that with a character range:

  $basename =~ s![\000-\377]*/!!;

or more simply by tagging a s modifier onto the substitute:

  $basename =~ s!.*/!!s;

The s modifier changes . so that it matches newlines as well, and we now get any possible character there. Another useful modifier is the case-insensitive modifier of i. For example, /[aeiou]/i finds any vowel in upper or lower case. Note that you can also write that as /a|e|i|o|u/i, but the character class version will be considerably faster.

What if we had wanted to find the nearest slash instead of the furthest slash? The easiest way is to tell the * repetition operator to be ``lazy'' instead of ``greedy''. Placing a ? immediately after a repetition operator tells it to take as few matches of that atom as possible, instead of the greatest number. For example:

  my $filename = "/home/bob/summary";
  my $one = my $two = $filename;
  $one =~ s!/.*/!/etc/!; # "/etc/summary";
  $two =~ s!/.*?/!/etc/!; # "/etc/bob/summary";

For $one, we grabbed the first slash, as many characters as we could, and then the final (third) slash.

But for $two, we grabbed the first slash, as few characters as we could, and then the next immediate slash (the second slash). Note that this didn't find the ``shortest overall match'' as some people have claimed incorrectly (which would have been ``/bob/'' rather than ``/home/''). It still starts with the first slash. This is similar to how /([ab]+)/ will match the a's in ___aa___bbb___, rather than the (longer) sequence of B's. It's ``leftmost match first'' and then the repetitions individually have biases towards ``longer matches'' (the default) or ``shorter matches'' from that starting point.

The split operator uses its regular expression to define a ``delimiter'', which is then found (usually multiple times) in a string. Each match is discarded (leading one of my friends to call it the ``deliminator''), leaving us with the pieces of string left as the list return value. So, a typical /etc/passwd-style file is parsed with relative ease:

  my $line = "merlyn:x:904:100:Randal L. Schwartz:/home/merlyn:/bin/perl\n";
  chomp $line;
  my @values = split /:/, $line;

Now @values has seven elements, corresponding to the seven items between the delimiters. If two colons were in a row, we'd get an empty element in the list:

  my @values = split /:/, "merlyn2::905:100::/home/merlyn2:/bin/perl";

Here, the second and fifth elements of @values are empty. Had we instead used /:+/ for the delimiter expression, those two consecutive colons would have been considered one big fat delimiter, and we'd have gotten five return values instead of seven.

This is typically desired when we are using whitespace as the delimiter: we'll use /\s+/ for the expression, because generally a hunk of whitespace in a row is a big fat delimiter, not many small omitted items.

Sometimes, it's easier to specify what we keep instead of what we throw away. For example, suppose I want to keep any integer or floating point values in a line, discarding anything else that doesn't look like a number. For that, we can use a match with a g modifier (for global) in a list context, which contributes $1 to a list result for each match:

  $_ = '12.24 dollars for 35 fish?  Are you crazy?!';
  my @hits = /([0-9]+(?:\.[0-9]+)?)/g;

Now @hits will be 12.24 and 35. We can pick out the following words using the regular expression we presented earlier.

  my @hits2 = /([0-9]+(?:\.[0-9]+)?)\s+([a-z]+)/g;

Now @hits2 is "12.24", "dollars", "35", "fish", because on each match, we contribute the two memories to the result.

So, this is just a start, but I've run out of space. Some other things to look up in the perlre documentation (via perldoc perlre) include other assertions (such as lookahead and lookbehind assertions), scalar use of the match g modifier, creating regular expressions from variables, using whitespace within the regular expression to embed commentary, evaluating code during the match, and so on. And you might check out the perlretut page while you're at it, which covers a lot of the same ground as what you've just read, but in a different way. Hope this helps! Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 27 (Aug 2001)