Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 28 (Oct 1999)

[suggested title: Compiling Regular Expressions]

Perl's regular expressions set Perl apart from most other scripting languages. Some features (like the positive and negative lookahead, and the optional lazy evaluation of multipliers) make matching troublesome strings trivial in Perl. This power has not gone unnoticed by the open software community -- a package called PCRE available on the net (use your favorite search engine) claims to implement ``a regular expression matching engine that's compatible with Perl 5.004's syntax''.

Just like the strings they are matching, regular expressions can come from many sources. The most common source is inside the program:

    @ARGV = qw(/usr/dict/words);
    while (<>) {
        print if /foo/ || /bar/;
    }

This little guy ran in about a quarter of a CPU second for me, and generated a nice list of words that contain foo and bar. Notice that I wrote /foo/ and /bar/ as separate regular expressions, instead of the seemingly identical /foo|bar/. Why did I do that? Experience. As reported by the following program:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    use Benchmark;
    timethese (10 => {
        'expression or' =>
            '@x = grep /foo/ || /bar/, @words',
        'regex or' =>
            '@x = grep /foo|bar/, @words',
    });

we get the following output from Benchmark:

    Benchmark: timing 10 iterations of expression or, regex or...
    expression or:  1 wallclock secs ( 0.97 usr +  0.00 sys =  0.97 CPU)
    regex or:  3 wallclock secs ( 2.87 usr +  0.00 sys =  2.87 CPU)

This shows for this example, using the regular expression | operator was more than twice as slow. There are certain optimizations that kick in when the text to be matched is a constant string that cannot be done when we have something more complex, or so says the online documentation. The exact list of optimized things varies from release to release of Perl, so Benchmark is your friend.

Often, the regular expression will come from a computed string, such as the value of a web form field or a command-line parameter or the result of a prompted query. The Perl syntax lets us interpolate a variable into a regular expression, permitting the regular expression to be created at runtime:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    $regex1 = "foo";
    @result = grep /$regex1/, @words;

In order to be useful, a regular expression first has to be ``compiled''. This is often an expensive step compared to the actual matching of the string to the regular expression. So, usually, this compilation happens while the Perl program itself is parsed and compiled, before the execution of the program. However, when part of a regular expression is a variable (such as the example above), Perl can't do that, and defers the regular expression compilation until execution time.

While this provides a powerful option (creating a regular expression at runtime, it comes with a performance penalty if used incorrectly). Let's test this out:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    push @words, @words;
    push @words, @words;
    use Benchmark;
    $foo = '[f][o][o]';
    $bar = '[b][a][r]';
    timethese (5 => {
        'constant' =>
            '@x = grep /[f][o][o]/ || /[b][a][r]/, @words',
        'one variable' =>
            '@x = grep /$foo/ || /[b][a][r]/, @words',
        'two variables' =>
            '@x = grep /$foo/ || /$bar/, @words',
    });

And here's the results:

    Benchmark: timing 5 iterations of constant, one variable, two variables...
    constant:  3 wallclock secs ( 2.86 usr +  0.00 sys =  2.86 CPU)
    one variable:  4 wallclock secs ( 3.49 usr +  0.00 sys =  3.49 CPU)
    two variables:  4 wallclock secs ( 4.11 usr +  0.00 sys =  4.11 CPU)

Notice that we're paying a penalty for the recompilation of the regular expression. (I made the regular expression a little more complicated and used a little more data to make it obvious.)

Why is this? Well, on each match between an element of @words and the regular expression defined with a variable, Perl has to recompile the regular expression over, and over, and over again.

One way to fix this is to use the ``once-only'' modifier on a regular expression. By appending a o suffix to the regular expression match operator, any deferred compilation will be executed only once per program invocation. Let's see if this helps:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    push @words, @words;
    push @words, @words;
    use Benchmark;
    $foo = '[f][o][o]';
    $bar = '[b][a][r]';
    timethese (5 => {
        'constant' =>
            '@x = grep /[f][o][o]/ || /[b][a][r]/, @words',
        'two variables' =>
            '@x = grep /$foo/ || /$bar/, @words',
        'two variables - opt' =>
            '@x = grep /$foo/o || /$bar/o, @words',
    });

And when we see the results, we confirm that it has helped significantly:

    Benchmark: timing 5 iterations of constant, two variables, two variables - opt...
    constant:  3 wallclock secs ( 2.86 usr +  0.01 sys =  2.87 CPU)
    two variables:  4 wallclock secs ( 4.15 usr +  0.01 sys =  4.16 CPU)
    two variables - opt:  3 wallclock secs ( 2.98 usr +  0.00 sys =  2.98 CPU)

Yes, now we're getting close to the original compiled-in regular expression speed. But there's a downside to using this once-only flag -- we can't change the regular expression after it has been used once.

So code like this is fine:

    $var = param('search_for');
    @results = grep /$var/o, @input;

But code like this is very broken, and hard to track down:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    for $item (qw(foo bar)) {
        @results = grep /$item/o, @words;
        print @results. " words match $item\n";
    }

which prints:

    43 words match foo
    43 words match bar

instead of the proper answer:

    43 words match foo
    131 words match bar

which I got after removing the o suffix. That's because the foo string got memorized into the regular expression compilation, even after $item changed for the second iteration.

So what are we to do? How do we get the speed of a once-compiled regular expression but still be able to loop through many search patterns?

One way is to use an anonymous subroutine that is compiled once for each change in the pattern variable. That'd look like this:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    for $item (qw(foo bar)) {
        $match = eval 'sub { $_[0] =~ /$item/o }';
        @results = grep $match->($_), @words;
        print @results. " words match $item\n";
    }

Which again prints the right values. What's happening here is a bit weird... we're still using $item as the pattern, but because each iteration of the loop re-compiles the anonymous subroutine (referenced by $match), the once-only flag effectively resets.

Of course, we're throwing away the result of the compilation on the next iteration of the loop, but at least we're not recompiling for each item in @words.

You can even make a subroutine that makes these subroutines:

    sub make_match {
        my $item = shift;
        eval 'sub { $_[0] =~ /$item/o }';
    }
    $match_foo = make_match "foo";
    $match_bar = make_match "bar";
    @foo_words = grep $match_foo->($_), @words;
    @bar_words = grep $match_bar->($_), @words;

Here, the reference to $item in the anonymous subroutine generates a ``closure'', which remembers the value of $item independently as long as the subroutine is alive.

And then there's Perl 5.005 to the rescue! The qr// quoting operator was introduced in this latest version of Perl. The purpose of the operator is to provide a way to compile regular expressions and pass the compiled values around in variables, rather than the original source strings.

Let's fix that search again:

    @ARGV = qw(/usr/dict/words);
    @words = <>;
    for $item (qw(foo bar)) {
        $compiled = qr/$item/;
        @results = grep /$compiled/, @words;
        print @results. " words match $item\n";
    }

Ahh yes, the right answer again. And Perl compiles the regular expression once, rather than each time the string is seen. We could even compile them all at once:

    @patterns = map { qr/$_/ } qw(foo bar);

And then use @patterns interpolated as we would have used the original strings.

I hope you've enjoyed this little compilation about compiling regular expressions. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 28 (Oct 1999)