Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 60 (Sep 2005)

[suggested title: ``How to match common things'']

Regular expressions can be handy when used correctly, to distinguish things of interest from amongst the strings in which they are hiding, and to reject strings that don't belong. These typical uses for text manipulation and input validation result in a lot of common regular expressions to solve these frequent tasks. However, I often see mistakes in selecting and applying a regular expression, so let's take a look at some of the more common mistakes. As I go through the examples, I'll presume the string to be validated is in $_ just to keep the examples simple, and I'll also use the slash delimiters (except where otherwise noted) for the regular expressions.

For example, one frequent check is to determine if a string contains a positive integer. If I wasn't thinking properly, I might start with something like /[0-9]+/ to say ``one or more digits''. I can simplify this to /\d+/, but that's still wrong, because the match isn't anchored. This means that the regular expression will match as long as the string contains the regular expression, including things like "abc123de". Oops.

So, the next step is to add anchors. Locking the regular expression down to both the beginning and ending of the string typically looks like /^\d+$/. However, this is still wrong, even though I frequently see this solution. The problem is that $ can match either before or after a final newline in the string, so this regular expression can match "123\n" as well as "123". Again, oops!

Luckily, modern Perl versions provide the \z anchor, which really does mean ``end of string'' always. So, the proper answer is /^\d+\z/. Or is it? Although deprecated, the $* variable controls the matching of ^ and $ to permit internal newline matches as well as end-of-string matches. If that variable is set, the string "foobar\n123" will also match our new regular expression. Oops again. So the proper answer is /\A\d+\z/, which says ``beginning of string'' followed by ``one or more digits'' followed by ``end of string''. Precisely, and accurately. Finally!

Now, at what point in this list of progressive regular expressions were you surprised? If not until the end, good for you. But hopefully you can see that regular expressions are a bit trickier than they seem.

As an alternative to all those special characters, I might just consider using a negative match against /\D/: that is, the string is fine as long as it doesn't contain a non-digit. But that's not precisely the opposite. See if you can figure out the one string that matches neither /\D/ nor /\A\d+\z before reading on.

That's right, the empty string! Again, you need to decide exactly what you want to match, and how you want to match it. Regular expressions are powerful, but as I recently heard in a movie, ``with great power comes great responsibility''.

I don't think more than a few weeks goes by before I see someone attempt to match or validate ``an email address'' by using an incorrect regular expression. Most people who are trying to validate an email address apparently have never heard of the RFCs, such as RFC 822 which has defined the standard Internet email address since 1982 (invalidating RFC 733 before that).

Because they base the email address on only what they've seen, they write broken regular expressions such as: /^\w+\@[\w+.]$/. The attempt here is to match word characters (alphanumerics and underscore) for both the local part (what we often call the user name), and the hostnames (to the right of the at-sign). Just starting with the hostname mistake, this excludes - (which is a valid hostname character), and includes the underscore (which isn't). Oops.

But even if you got that part right, through careful examination of hostnames, paying attention to the 2 character top-level domains for countries, and the 3, 4, and now more character top-level abstract domains, the big failure here is the left side.

RFC 822 is very liberal about what is accepted for the local part. Basically, to the left of the equal-sign, we see in the RFC that the definition of ``local-part'' is one or more period-connected ``words'', and that a word is either an atom or a quoted string, and that an atom is everything that doesn't contain whitespace or one of the special characters (matching /<\@,;:\\``\.\[\]/>). Wait? Does this mean that

  Randal.L.Schwartz@stonehenge.comm

is a valid email address? Yes! And that's already not matched by our previous regex. But even more, it means that my friend Eli-The-Bearded, who uses

  *@qz.to

as his email address is also using a valid email address!

Now, if you showed the first address to someone who wrote that first regular expression, they might quickly ``patch up'' that pattern to match periods as well as \w+. But that wouldn't be sufficient to match Eli's address. And it wouldn't work on addresses like:

  gateway."[foo]@bar//35"@relay.machine.oldcompany.com

where the local part contains quoted parts that need to be properly passed over when looking for at-signs and so on. To do that, you'd need to create something that mimics the RFC definition (local part is a series of period-connected words, and each word is either a non-special string, or a quoted string).

But that still wouldn't solve the last problem. RFC822 permits comments in the email address, enclosed in balanced parentheses. The example given in the RFC is:

  Muhammed.(I am  the greatest) Ali @(the)Vegas.WBA

That is, the address is actually Muhammed.Ali@Vegas.WBA, but the parenthesized parts are legally part of the email address, although ignored.

Well, that still doesn't look hard, because comments are permitted only between tokens. But the biting part of the specification is the word ``balanced''. If the parentheses can be nested, there's no way to get a normal regular expression to match it! (Yes, recent Perl versions have some extra tricks to get Perl code to execute during the matching of a regular expression, which would help us solve this, but let's rule that out for now.)

Even if we pre-process these comments and replace them with a single whitespace, the resulting regular expression (shown at http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html, for example) is over 6000 characters long. Not something you're going to cut and paste into each program, but thankfully we don't need to do that.

Instead, we can simply pull in Email::Address or Email::Valid from the CPAN. These modules encapsulate the rules for an RFC822 valid email address. That's the way to get it right.

Other useful regular expressions have been rolled into one module, Regexp::Common. For example, to match all HTTP URIs in a string, we can say:

  use Regexp::Common qw(URI);

  while (/($RE{URI}{HTTP})/g) {
    print "the string contains the URI $1\n";
  }

Again, we don't have to spend time staring at specifications: someone has done the work for us.

So, hopefully I've scared you enough into not inventing regular expressions on your own without looking around a bit for someone else who has gone ahead of you on the problem. Look at the CPAN first, learn to read every part of a regular expression, and ask around to see if your solution makes sense. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 60 (Sep 2005)