Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 60 (Sep 2005)
[suggested title: ``How to match common things'']
Regular expressions can be handy when used correctly, to distinguish
things of interest from amongst the strings in which they are hiding,
and to reject strings that don't belong. These typical uses for text
manipulation and input validation result in a lot of common regular
expressions to solve these frequent tasks. However, I often see
mistakes in selecting and applying a regular expression, so let's take
a look at some of the more common mistakes. As I go through the
examples, I'll presume the string to be validated is in $_
just to
keep the examples simple, and I'll also use the slash delimiters
(except where otherwise noted) for the regular expressions.
For example, one frequent check is to determine if a string contains a
positive integer. If I wasn't thinking properly, I might start with
something like /[0-9]+/
to say ``one or more digits''. I can
simplify this to /\d+/
, but that's still wrong, because the match
isn't anchored. This means that the regular expression will match
as long as the string contains the regular expression, including
things like "abc123de"
. Oops.
So, the next step is to add anchors. Locking the regular expression
down to both the beginning and ending of the string typically looks
like /^\d+$/
. However, this is still wrong, even though I
frequently see this solution. The problem is that $
can match
either before or after a final newline in the string, so this regular
expression can match "123\n"
as well as "123"
. Again, oops!
Luckily, modern Perl versions provide the \z
anchor, which really
does mean ``end of string'' always. So, the proper answer is
/^\d+\z/
. Or is it? Although deprecated, the $*
variable
controls the matching of ^
and $
to permit internal newline
matches as well as end-of-string matches. If that variable is set,
the string "foobar\n123"
will also match our new regular
expression. Oops again. So the proper answer is /\A\d+\z/, which
says ``beginning of string'' followed by ``one or more digits'' followed
by ``end of string''. Precisely, and accurately. Finally!
Now, at what point in this list of progressive regular expressions were you surprised? If not until the end, good for you. But hopefully you can see that regular expressions are a bit trickier than they seem.
As an alternative to all those special characters, I might just
consider using a negative match against /\D/
: that is, the string
is fine as long as it doesn't contain a non-digit. But that's not
precisely the opposite. See if you can figure out the one string
that matches neither /\D/
nor /\A\d+\z
before reading on.
That's right, the empty string! Again, you need to decide exactly what you want to match, and how you want to match it. Regular expressions are powerful, but as I recently heard in a movie, ``with great power comes great responsibility''.
I don't think more than a few weeks goes by before I see someone attempt to match or validate ``an email address'' by using an incorrect regular expression. Most people who are trying to validate an email address apparently have never heard of the RFCs, such as RFC 822 which has defined the standard Internet email address since 1982 (invalidating RFC 733 before that).
Because they base the email address on only what they've seen, they
write broken regular expressions such as: /^\w+\@[\w+.]$/
. The
attempt here is to match word characters (alphanumerics and
underscore) for both the local part (what we often call the user
name), and the hostnames (to the right of the at-sign). Just
starting with the hostname mistake, this excludes -
(which is a
valid hostname character), and includes the underscore (which
isn't). Oops.
But even if you got that part right, through careful examination of hostnames, paying attention to the 2 character top-level domains for countries, and the 3, 4, and now more character top-level abstract domains, the big failure here is the left side.
RFC 822 is very liberal about what is accepted for the local part.
Basically, to the left of the equal-sign, we see in the RFC that the
definition of ``local-part'' is one or more period-connected ``words'',
and that a word is either an atom or a quoted string, and that
an atom is everything that doesn't contain whitespace or one of
the special characters (matching /\(\)<
\@,;:\\``\.\[\]/>).
Wait? Does this mean that
Randal.L.Schwartz@stonehenge.comm
is a valid email address? Yes! And that's already not matched by our previous regex. But even more, it means that my friend Eli-The-Bearded, who uses
*@qz.to
as his email address is also using a valid email address!
Now, if you showed the first address to someone who wrote that first
regular expression, they might quickly ``patch up'' that pattern to
match periods as well as \w+
. But that wouldn't be sufficient to
match Eli's address. And it wouldn't work on addresses like:
gateway."[foo]@bar//35"@relay.machine.oldcompany.com
where the local part contains quoted parts that need to be properly passed over when looking for at-signs and so on. To do that, you'd need to create something that mimics the RFC definition (local part is a series of period-connected words, and each word is either a non-special string, or a quoted string).
But that still wouldn't solve the last problem. RFC822 permits comments in the email address, enclosed in balanced parentheses. The example given in the RFC is:
Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
That is, the address is actually Muhammed.Ali@Vegas.WBA
, but the
parenthesized parts are legally part of the email address, although
ignored.
Well, that still doesn't look hard, because comments are permitted only between tokens. But the biting part of the specification is the word ``balanced''. If the parentheses can be nested, there's no way to get a normal regular expression to match it! (Yes, recent Perl versions have some extra tricks to get Perl code to execute during the matching of a regular expression, which would help us solve this, but let's rule that out for now.)
Even if we pre-process these comments and replace them with a single
whitespace, the resulting regular expression (shown at
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
, for
example) is over 6000 characters long. Not something you're going to
cut and paste into each program, but thankfully we don't need to do
that.
Instead, we can simply pull in Email::Address
or Email::Valid
from the CPAN. These modules encapsulate the rules for an RFC822
valid email address. That's the way to get it right.
Other useful regular expressions have been rolled into one module,
Regexp::Common
. For example, to match all HTTP URIs in a string,
we can say:
use Regexp::Common qw(URI);
while (/($RE{URI}{HTTP})/g) { print "the string contains the URI $1\n"; }
Again, we don't have to spend time staring at specifications: someone has done the work for us.
So, hopefully I've scared you enough into not inventing regular expressions on your own without looking around a bit for someone else who has gone ahead of you on the problem. Look at the CPAN first, learn to read every part of a regular expression, and ask around to see if your solution makes sense. Until next time, enjoy!