Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 19 (March 1998)

Perl excels at text wrangling. There's no doubt about that. Perl is great for the one-off tasks that allow us to get our job done faster, easier, and more reliably. But sometimes we're called upon to write tools that have more flexibility, and thus take more than a few lines to do, or get used for more than one application.

In this column, I take a look at a typical text processing application with a fairly common twist -- processing a file line by line, but allowing for ``include files''. These include files will contain additional information to be processed, but only the filename will be indicated in the original file.

So, we have a file that we want to process line by line. This is fairly uninteresting, but it would look something like this:

    while (<>) {
        &process($_);
    }

where process is defined somewhere else. Yes, the basic ``read a line at a time into $_'' loop. if you're trying out the code snippets, you'll probably need a definition for this subroutine, so let's just use:

    sub process {
        my $line = shift;
        print "processed: $line";
    }

Now, let's say that some of the lines in the text file are actually ``include'' lines. If any line is of the form:

    #include fred

then the contents of file fred are automatically processed as if they were part of the original file at this location. For this, we'll have to recognize the line and extract the filename -- not hard at all with regular expressions:

    if (/^#include (\S+)/) {
        $name = $1;
        ...;
    }

And now we have to decide what goes into the ``...'' here. Well, we need to open a file, read its contents line by line, and then close the file. Not hard -- we already have the name:

    if (/^#include (\S+)/) {
        $name = $1;
        open F, $name or die
            "Cannot open $name: $!";
        while (<F>) {
            &process($_);
        }
        close F;
    } else { # wasn't include
        &process($_);
    }

And this all goes inside the outer while loop presented initially. Great, now we can handle include files.

Well, at least one level deep. But what if the included file itself wants to include another file? That won't work here. There's nothing in the inner ``read F'' loop that looks for include. But both of the loops call &process, so that gives a possible idea. Let's invert the logic a bit:

    while (<>) {
        &process_or_include($_);
    }

    sub process_or_include {
        local $_ = shift;
        if (/^#include (\S+)/) {
            &include($1);
        } else {
            &process($_);
        }
    }

Here, we've changed the top-most loop to call &process_or_include with each line. The subroutine takes the line, shifts it into a new local $_ (to make it easy to do regular expressions).

Next, if the line begins with ``include'' and contains a non-blank filename, we call &include (defined below), but if not, we call the original &process with the ordinary line.

Now we just have to figure out how to define &include, which is passed a filename, and will presumeably call &process_or_include on each line. This shouldn't be hard... just write it like before:

    sub include {
        my $name = shift;
        open F, $name or die
            "Cannot open $name: $!";
        while (<F>) {
            &process_or_include($_);
        }
        close F;
    }

As before, the incoming filename is shifted into $name, which is then used to open the F filehandle (causing death if not available for reading). The filehandle is then read one line at a time into $_, and then we call &process_or_include (defined above) for each line. When we're done with the contents of the file, the filehandle is closed.

Yes. Right. Uh, no, not right. Why not? At first glance, it looks correct. For each line of the file opened on filehandle F, process it or treat it as include. But that's the problem. There's only one filehandle name, and it's not local to the subroutine, so each recursive invocation of &include shares the same F filehandle.

What we need is a local filehandle. In older versions of Perl, this was rather messy, but modern versions handle it quite nicely. First, we add

    use IO::File;

to the beginning of the script. This ``extends'' Perl to include the knowledge about the IO::File object. Don't worry: you won't need to know object-oriented programming to take advantage of this. Just a few simple syntax constructs.

Instead of opening an explicit filehandle, we now need to open a new local IO::File object in the subroutine:

    sub include {
        my $name = shift;
        my $F = IO::File->new($name)
            or die "Cannot open $name: $!";
        while (<$F>) {
            &process_or_include($_);
        }
    }

Notice that instead of calling open on a filehandle <F>, we're now calling the method new to return an IO::File object into the local scalar $F. We can then use this scalar anywhere we had previously used the filehandle, and it works just fine. However, recursive invocations of this subroutine will create brand new IO::File objects!

The bottom part of this subroutine reads a line at a time, but now from the object held in the $F variable. This still gets read into $_, which is then checked for defined or undef.

Another interesting side-effect is that we no longer need to close the filehandle. When the variable $f goes out of scope (at the end of this subroutine), the corresponding ``filehandle'' is automatically closed. Very cool.

OK, so now we have a nice include processor that handles includes of includes. How else can we make it more useful? How about a ``search path'', giving many places where the file can be found if it's a relative file?

Not much tougher. Just need to add some changes to the file open. Let's factor it out of the include subroutine:

    sub include {
        my $name = shift;
        my $F = &find_file($name);
        while (<$F>) {
            &process_or_include($_);
        }
    }

The filename to be included is shifted into $name, as before. Then instead of opening the file directly, we call &find_file (defined below). The return value is an IO::File object which gets used in the same way as before, including its eventual recycling at the end of this subroutine.

And &find_file is defined as follows:

    sub find_file {
        local $_ = shift;
        if (/^\//) { # absolute
            return &must_open($_);
        }
        # relative
        for $dir (@path) {
            my $full = "$dir/$_";
            if (-e $full) {
                return &must_open($_);
            }
        }
        die "Cannot find $_ in @dirs";
    }

The incoming parameter is a filename to be searched along the search list. This is stored into the $_ variable. If the $_ variable begins with a forward slash, it's an absolute path and must be used as-is. So, we'll call &must_open (defined below) with that full name, and return whatever it returns (if any), which should be an IO::File object.

If the name doesn't begin with a forward slash, it's time to try it in each of the directories defined in the search path. The variable $dir is set to each element of @path, one at a time. A full pathname is constructed into $full (a temporary scalar variable). If this file exists (tested with -e), then we try to open it by calling the same &must_open routine. If not, we go try the next path name in series.

If the @path elements are completely scanned, and no suitable file has been found, then this subroutine triggers a die with an appropriate error message.

The function &must_open is defined as follows:

    sub must_open {
        my $name = shift;
        IO::File->new($name) or die
            "Cannot open $name: $!";
    }

Here, the passed-in filename $name is either opened, or we die. If the open succeeds, that becomes the return value. We can pass IO::File objects back and forth on the stack -- that's certainly a lot more convenient than a filehandle!

So, this example showed some basic text processing, local filehandles, and recursive subroutines. Not bad for a little hack. See you next time...

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Unix Review Column 19 (March 1998)