Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 28 (Oct 1999)
[suggested title: Compiling Regular Expressions]
Perl's regular expressions set Perl apart from most other scripting languages. Some features (like the positive and negative lookahead, and the optional lazy evaluation of multipliers) make matching troublesome strings trivial in Perl. This power has not gone unnoticed by the open software community -- a package called PCRE available on the net (use your favorite search engine) claims to implement ``a regular expression matching engine that's compatible with Perl 5.004's syntax''.
Just like the strings they are matching, regular expressions can come from many sources. The most common source is inside the program:
@ARGV = qw(/usr/dict/words); while (<>) { print if /foo/ || /bar/; }
This little guy ran in about a quarter of a CPU second for me, and
generated a nice list of words that contain foo
and bar
.
Notice that I wrote /foo/
and /bar/
as separate regular
expressions, instead of the seemingly identical /foo|bar/
. Why
did I do that? Experience. As reported by the following program:
@ARGV = qw(/usr/dict/words); @words = <>; use Benchmark; timethese (10 => { 'expression or' => '@x = grep /foo/ || /bar/, @words', 'regex or' => '@x = grep /foo|bar/, @words', });
we get the following output from Benchmark
:
Benchmark: timing 10 iterations of expression or, regex or... expression or: 1 wallclock secs ( 0.97 usr + 0.00 sys = 0.97 CPU) regex or: 3 wallclock secs ( 2.87 usr + 0.00 sys = 2.87 CPU)
This shows for this example, using the regular expression |
operator was more than twice as slow. There are certain optimizations
that kick in when the text to be matched is a constant string that
cannot be done when we have something more complex, or so says the
online documentation. The exact list of optimized things varies from
release to release of Perl, so Benchmark
is your friend.
Often, the regular expression will come from a computed string, such as the value of a web form field or a command-line parameter or the result of a prompted query. The Perl syntax lets us interpolate a variable into a regular expression, permitting the regular expression to be created at runtime:
@ARGV = qw(/usr/dict/words); @words = <>; $regex1 = "foo"; @result = grep /$regex1/, @words;
In order to be useful, a regular expression first has to be ``compiled''. This is often an expensive step compared to the actual matching of the string to the regular expression. So, usually, this compilation happens while the Perl program itself is parsed and compiled, before the execution of the program. However, when part of a regular expression is a variable (such as the example above), Perl can't do that, and defers the regular expression compilation until execution time.
While this provides a powerful option (creating a regular expression at runtime, it comes with a performance penalty if used incorrectly). Let's test this out:
@ARGV = qw(/usr/dict/words); @words = <>; push @words, @words; push @words, @words; use Benchmark; $foo = '[f][o][o]'; $bar = '[b][a][r]'; timethese (5 => { 'constant' => '@x = grep /[f][o][o]/ || /[b][a][r]/, @words', 'one variable' => '@x = grep /$foo/ || /[b][a][r]/, @words', 'two variables' => '@x = grep /$foo/ || /$bar/, @words', });
And here's the results:
Benchmark: timing 5 iterations of constant, one variable, two variables... constant: 3 wallclock secs ( 2.86 usr + 0.00 sys = 2.86 CPU) one variable: 4 wallclock secs ( 3.49 usr + 0.00 sys = 3.49 CPU) two variables: 4 wallclock secs ( 4.11 usr + 0.00 sys = 4.11 CPU)
Notice that we're paying a penalty for the recompilation of the regular expression. (I made the regular expression a little more complicated and used a little more data to make it obvious.)
Why is this? Well, on each match between an element of @words
and
the regular expression defined with a variable, Perl has to
recompile the regular expression over, and over, and over again.
One way to fix this is to use the ``once-only'' modifier on a regular
expression. By appending a o
suffix to the regular expression
match operator, any deferred compilation will be executed only
once per program invocation. Let's see if this helps:
@ARGV = qw(/usr/dict/words); @words = <>; push @words, @words; push @words, @words; use Benchmark; $foo = '[f][o][o]'; $bar = '[b][a][r]'; timethese (5 => { 'constant' => '@x = grep /[f][o][o]/ || /[b][a][r]/, @words', 'two variables' => '@x = grep /$foo/ || /$bar/, @words', 'two variables - opt' => '@x = grep /$foo/o || /$bar/o, @words', });
And when we see the results, we confirm that it has helped significantly:
Benchmark: timing 5 iterations of constant, two variables, two variables - opt... constant: 3 wallclock secs ( 2.86 usr + 0.01 sys = 2.87 CPU) two variables: 4 wallclock secs ( 4.15 usr + 0.01 sys = 4.16 CPU) two variables - opt: 3 wallclock secs ( 2.98 usr + 0.00 sys = 2.98 CPU)
Yes, now we're getting close to the original compiled-in regular expression speed. But there's a downside to using this once-only flag -- we can't change the regular expression after it has been used once.
So code like this is fine:
$var = param('search_for'); @results = grep /$var/o, @input;
But code like this is very broken, and hard to track down:
@ARGV = qw(/usr/dict/words); @words = <>; for $item (qw(foo bar)) { @results = grep /$item/o, @words; print @results. " words match $item\n"; }
which prints:
43 words match foo 43 words match bar
instead of the proper answer:
43 words match foo 131 words match bar
which I got after removing the o
suffix. That's because the foo
string got memorized into the regular expression compilation, even
after $item
changed for the second iteration.
So what are we to do? How do we get the speed of a once-compiled regular expression but still be able to loop through many search patterns?
One way is to use an anonymous subroutine that is compiled once for each change in the pattern variable. That'd look like this:
@ARGV = qw(/usr/dict/words); @words = <>; for $item (qw(foo bar)) { $match = eval 'sub { $_[0] =~ /$item/o }'; @results = grep $match->($_), @words; print @results. " words match $item\n"; }
Which again prints the right values. What's happening here is a bit
weird... we're still using $item
as the pattern, but because each
iteration of the loop re-compiles the anonymous subroutine (referenced
by $match
), the once-only flag effectively resets.
Of course, we're throwing away the result of the compilation on the next
iteration of the loop, but at least we're not recompiling for each
item in @words
.
You can even make a subroutine that makes these subroutines:
sub make_match { my $item = shift; eval 'sub { $_[0] =~ /$item/o }'; } $match_foo = make_match "foo"; $match_bar = make_match "bar"; @foo_words = grep $match_foo->($_), @words; @bar_words = grep $match_bar->($_), @words;
Here, the reference to $item
in the anonymous subroutine generates
a ``closure'', which remembers the value of $item
independently as
long as the subroutine is alive.
And then there's Perl 5.005 to the rescue! The qr//
quoting
operator was introduced in this latest version of Perl. The purpose
of the operator is to provide a way to compile regular expressions and
pass the compiled values around in variables, rather than the original
source strings.
Let's fix that search again:
@ARGV = qw(/usr/dict/words); @words = <>; for $item (qw(foo bar)) { $compiled = qr/$item/; @results = grep /$compiled/, @words; print @results. " words match $item\n"; }
Ahh yes, the right answer again. And Perl compiles the regular expression once, rather than each time the string is seen. We could even compile them all at once:
@patterns = map { qr/$_/ } qw(foo bar);
And then use @patterns
interpolated as we would have used the
original strings.
I hope you've enjoyed this little compilation about compiling regular expressions. Until next time, enjoy!