Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Unix Review Column 19 (March 1998)
Perl excels at text wrangling. There's no doubt about that. Perl is great for the one-off tasks that allow us to get our job done faster, easier, and more reliably. But sometimes we're called upon to write tools that have more flexibility, and thus take more than a few lines to do, or get used for more than one application.
In this column, I take a look at a typical text processing application with a fairly common twist -- processing a file line by line, but allowing for ``include files''. These include files will contain additional information to be processed, but only the filename will be indicated in the original file.
So, we have a file that we want to process line by line. This is fairly uninteresting, but it would look something like this:
while (<>) { &process($_); }
where process
is defined somewhere else. Yes, the basic ``read a line at a
time into $_'' loop. if you're trying out the code snippets, you'll probably
need a definition for this subroutine, so let's just use:
sub process { my $line = shift; print "processed: $line"; }
Now, let's say that some of the lines in the text file are actually ``include'' lines. If any line is of the form:
#include fred
then the contents of file fred are automatically processed as if they were part of the original file at this location. For this, we'll have to recognize the line and extract the filename -- not hard at all with regular expressions:
if (/^#include (\S+)/) { $name = $1; ...; }
And now we have to decide what goes into the ``...'' here. Well, we need to open a file, read its contents line by line, and then close the file. Not hard -- we already have the name:
if (/^#include (\S+)/) { $name = $1; open F, $name or die "Cannot open $name: $!"; while (<F>) { &process($_); } close F; } else { # wasn't include &process($_); }
And this all goes inside the outer while
loop presented initially. Great,
now we can handle include files.
Well, at least one level deep. But what if the included file itself wants to
include another file? That won't work here. There's nothing in the inner ``read
F'' loop that looks for include. But both of the loops call &process
, so
that gives a possible idea. Let's invert the logic a bit:
while (<>) { &process_or_include($_); }
sub process_or_include { local $_ = shift; if (/^#include (\S+)/) { &include($1); } else { &process($_); } }
Here, we've changed the top-most loop to call &process_or_include
with each
line. The subroutine takes the line, shifts it into a new local $_ (to make it
easy to do regular expressions).
Next, if the line begins with ``include'' and contains a non-blank filename, we
call &include
(defined below), but if not, we call the original &process
with the ordinary line.
Now we just have to figure out how to define &include
, which is passed a
filename, and will presumeably call &process_or_include
on each line. This
shouldn't be hard... just write it like before:
sub include { my $name = shift; open F, $name or die "Cannot open $name: $!"; while (<F>) { &process_or_include($_); } close F; }
As before, the incoming filename is shifted into $name, which is then used to
open the F
filehandle (causing death if not available for reading). The
filehandle is then read one line at a time into $_, and then we call
&process_or_include (defined above) for each line. When we're done with the
contents of the file, the filehandle is closed.
Yes. Right. Uh, no, not right. Why not? At first glance, it looks correct. For
each line of the file opened on filehandle F, process it or treat it as
include. But that's the problem. There's only one filehandle name, and it's
not local to the subroutine, so each recursive invocation of &include
shares
the same F
filehandle.
What we need is a local filehandle. In older versions of Perl, this was rather messy, but modern versions handle it quite nicely. First, we add
use IO::File;
to the beginning of the script. This ``extends'' Perl to include the knowledge
about the IO::File
object. Don't worry: you won't need to know
object-oriented programming to take advantage of this. Just a few simple syntax
constructs.
Instead of opening an explicit filehandle, we now need to open a new local
IO::File
object in the subroutine:
sub include { my $name = shift; my $F = IO::File->new($name) or die "Cannot open $name: $!"; while (<$F>) { &process_or_include($_); } }
Notice that instead of calling open
on a filehandle <F>, we're now calling
the method new
to return an IO::File
object into the local scalar $F
.
We can then use this scalar anywhere we had previously used the filehandle, and
it works just fine. However, recursive invocations of this subroutine will
create brand new IO::File
objects!
The bottom part of this subroutine reads a line at a time, but now from the object held in the $F variable. This still gets read into $_, which is then checked for defined or undef.
Another interesting side-effect is that we no longer need to close the
filehandle. When the variable $f
goes out of scope (at the end of this
subroutine), the corresponding ``filehandle'' is automatically closed. Very cool.
OK, so now we have a nice include processor that handles includes of includes. How else can we make it more useful? How about a ``search path'', giving many places where the file can be found if it's a relative file?
Not much tougher. Just need to add some changes to the file open. Let's factor it out of the include subroutine:
sub include { my $name = shift; my $F = &find_file($name); while (<$F>) { &process_or_include($_); } }
The filename to be included is shifted into $name, as before. Then instead of
opening the file directly, we call &find_file
(defined below). The return
value is an IO::File
object which gets used in the same way as before,
including its eventual recycling at the end of this subroutine.
And &find_file
is defined as follows:
sub find_file { local $_ = shift; if (/^\//) { # absolute return &must_open($_); } # relative for $dir (@path) { my $full = "$dir/$_"; if (-e $full) { return &must_open($_); } } die "Cannot find $_ in @dirs"; }
The incoming parameter is a filename to be searched along the search list. This
is stored into the $_ variable. If the $_ variable begins with a forward slash,
it's an absolute path and must be used as-is. So, we'll call &must_open
(defined below) with that full name, and return whatever it returns (if any),
which should be an IO::File
object.
If the name doesn't begin with a forward slash, it's time to try it in each of
the directories defined in the search path. The variable $dir is set to each
element of @path, one at a time. A full pathname is constructed into $full (a
temporary scalar variable). If this file exists (tested with -e
), then we
try to open it by calling the same &must_open
routine. If not, we go try the
next path name in series.
If the @path elements are completely scanned, and no suitable file has been
found, then this subroutine triggers a die
with an appropriate error
message.
The function &must_open
is defined as follows:
sub must_open { my $name = shift; IO::File->new($name) or die "Cannot open $name: $!"; }
Here, the passed-in filename $name
is either opened, or we die
. If the
open succeeds, that becomes the return value. We can pass IO::File
objects
back and forth on the stack -- that's certainly a lot more convenient than a
filehandle!
So, this example showed some basic text processing, local filehandles, and recursive subroutines. Not bad for a little hack. See you next time...