Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 10 (Mar 2000)

[suggested title: Filehandle references]

In the past two columns, I looked at using ``references'' in Perl, and showed the basic syntax for creating references to arrays, hashes, scalars, and subroutines. I also described the canonical form of converting a non-reference expression into a reference, and also how to use the shortcut rules to make this simpler.

Let's take a look now at filehandles and directoryhandles. These handles let us look at or talk to the ``outside world'', useful if we want our program to have some permanent impact on the computing environment.

First, recall that a filehandle doesn't really have a syntax to make it a full fledged variable. You can't assign it, use it (directly) with local() or my(), or pass it to or from a subroutine or store it into a data structure. What does that leave? Well, there are about a dozen operations that use a filehandle or a directoryhandle, specified by a bareword (an alphanumeric symbol sequence separated by double colons, like STDIN or MyPack::Output). For refreshers, that looks like this:

  while (<INPUT>) {
    last unless /\S/;
    print OUTPUT $_;
  }

Now, that's a nice ordinary chunk of code. In fact, it's a nicely useful chunk of code which copies all of the contents on the filehandle INPUT to the filehandle OUTPUT, until the first blank link (great for processing news or mail messages or HTTP responses).

We could drop this code into a subroutine so that it might be used from various places around my program:

  sub copyheader {
    while (<INPUT>) {
      last unless /\S/;
      print OUTPUT $_;
    }
  }

But to use this subroutine, I have to specifically use the filehandles called INPUT and OUTPUT. What if I wanted to, say, copy MAILMSG to STDOUT? I can't assign INPUT from MAILMSG. But I can get almost the same thing with a glob assigment:

  *INPUT = *MAILMSG;
  *OUTPUT = *STDOUT;
  copyheader();

Now, a bit of explanation is needed here. The prefix asterisk operator in *INPUT means ``access (or alter) a magic value that denotes the symboltable entry for everything named INPUT''. Now you don't have to know too much about how Perl stores things, but suffice it to say that when we execute *INPUT = *MAILMSG, any reference to anything named INPUT is automatically redirected to the current corresponding item named MAILMSG. This is true for $INPUT, @INPUT, %INPUT, and &INPUT. Those, we don't care about, but also any use of the INPUT filehandle is automatically mapped to the MAILMSG filehandle instead! Now, that's useful.

So, while the subroutine thinks it is copying from INPUT to OUTPUT, we're actually in effect copying from MAILMSG to STDOUT. The downside is that we can no longer access anything originally named INPUT, so we should choose the name wisely. And, the change is also permanent. Or is it? Not if we use local() in the right way:

  {
    local *INPUT = *MAILMSG;
    local *OUTPUT = *STDOUT;
    copyheader();
  }

Here, the assigment that aliases ``all things INPUT'' to ``all things MAILMSG'' is done as a local operation, meaning it will be undone at the end of the enclosing block. That's good news, because outside this block, everything is as it once was (except the MAILMSG and STDOUT filehandles are now in new positions within their respective files).

We've still hit a bit too much though. Suppose the inside block wanted to access the original @INPUT. That's reasonable, since we really needed to alias only the filehandles across. That's just a bit trickier, with:

  {
    local *INPUT = *MAILMSG{IO};
    local *OUTPUT = *STDOUT{IO};
    copyheader();
  }

The {IO} suffix indicates that we don't want the symbol access for everything named MAILMSG, but just the filehandle and directory handle named MAILMSG. The alias assignment therefore only (temporarily) messes up just those, and not the scalar, array, hash, or subroutine symbols as well. To select those individually, we can use *FOO{SCALAR}, *FOO{ARRAY}, *FOO{HASH} or *FOO{CODE}, respectively. It's also important to note that if any of these have not yet been used in a program, the value will be undef, and won't make any sense to alias to another symbol entry.

We can also use this syntax to pass these entries into a subroutine:

  copyheader(*MAILMSG{IO}, *STDOUT{IO});
  sub copyheader {
    local *INPUT = shift;
    local *OUTPUT = shift;
    while (<INPUT>) {
      last unless /\S/;
      print OUTPUT $_;
    }
  }

Hey, nearly like assignable filehandles now, albeit with an ugly syntax. We can simplify this even further with another trick. Nearly any place you have a filehandle, you can also stick a simple scalar variable. You can get one of this simple scalar variables as a normal lexical variable, so we can use these to grab the subroutines input parameters like so:

  sub copyheader {
    my $in = shift;
    my $out = shift;
    while (<$in>) {
      last unless /\S/;
      print $out $_;
    }
  }

Wow. Looking much cleaner. Notice the use of the filehandle read operator (angle brackets) around the outside of the scalar variable. Some of the documentation refers to this as indirect filehandles, but that's just fancy talk.

Now, these subroutines have been using existing filehandles passed in by the caller. Can we likewise return a filehandle? Sure, by using the same syntax, roughly speaking.

  sub get_body_handle {
    my $filename = shift;
    local *IN;
    open IN, $filename or die "$filename: $!";
    while (<IN>) {
      last unless /\S/;
    }
    return *IN{IO};
  }
  {
    my $handle = get_body_handle("/home/merlyn/Mail/inbox/101");
    print "body: $_" while <$handle>;
  }

Here, we're creating a local symbol table entry in IN, which won't mess up any global use of the same name. Then, a normal open() connects up the filehandle, and we seek forward until we've found the blank line. The return passes back the filehandle portion of the symbol table entry, and that's captured in $handle. And nicely enough, when the $handle variable goes out of scope, the filehandle is automatically implicitly closed, freeing up resources.

But that local *IN still bugs me. If the subroutine had needed to access @IN at the same time, we'd be in trouble. Worse yet, all the normal problems with local come into play... if this subroutine calls another subroutine, all things named IN are still obscured, probably very confusing to that other subroutine.

So, let's be slightly trickier, and we'll get all the same goodies without any of the downsides:

  sub get_body_handle {
    my $filename = shift;
    my $in = do { local *IN };
    open $in, $filename or die "$filename: $!";
    while (<$in>) {
      last unless /\S/;
    }
    return $in;
  }
  {
    my $handle = get_body_handle("/home/merlyn/Mail/inbox/101");
    print "body: $_" while <$handle>;
  }

Ahh, so the first thing you might notice is that I've gone back to indirect filehandle notation, using a simple scalar variable. But this variable is being initialized using a do-block. Inside this do-block we'll create temporary symbol table entry, then return it. This is a very quick operation, almost entirely unlikely to mess up anyone (except signal handlers executed in that small window, but that would have its own troubles).

The symbol table name (here IN) is arbitrary. Also, if the filehandle container variable $in had gone out of scope before returning a value, the filehandle would have been closed automatically.

Well, we now have passing filehandles into subroutines, returning them from subroutines, and even creating local filehandles. We can also store these filehandles into an arbitrary data structure, and create directory handles the same way. For example, let's list a directory returning the names of the ten most recently modified files using a localized directory handle:

  sub get_ten_newest_files {
    my $dirname = shift;
    my $handle = do { local *X };
    opendir $handle, $dirname or die "$dirname: $!";
    my @names = map "$dirname/$_", readdir $handle;
    @names = map { $_->[0] }
      sort { $b->[1] <=> $a->[1] }
      map { [$_, (stat)[9]] }
      grep { /\d$/ }
      @names;
    splice @names, 10 if @names > 10;
    @names;
  }    
  my @newest = get_ten_newest_files("/home/merlyn/Mail/inbox");
  print "$_\n" for @newest;

Here, we create a local handle for a directory handle (as $handle), then open that directory handle onto our selected directory. After fetching all the names, I do a Schwartzian Transform (named after me, but not by me, it's a long story) to order them by descending modtimes, as well as selecting only the message files.

Could we have also already opened all those files? I mean, can we stick the filehandles into a data structure and pass it around? Sure enough. Let's make the return value a 10-element list where each element is an arrayref to a two-element array of a filename and its already-opened filehandle. For grins, the filehandle wil be already positioned to its body (past the header). So, here goes:

  sub get_ten_newest_files {
    my $dirname = shift;
    my $handle = do { local *X };
    opendir $handle, $dirname or die "$dirname: $!";
    my @names = map "$dirname/$_", readdir $handle;
    @names = map { $_->[0] }
      sort { $b->[1] <=> $a->[1] }
      map { [$_, (stat)[9]] }
      grep { /\d$/ }
      @names;
    splice @names, 10 if @names > 10;
    return map {
      my $name = $_;
      my $fh = do { local *X };
      open $fh, $name or die "Cannot open $name: $!";
      while (<$fh>) {
        last unless /\S/;
      }
      [$name, $fh];
    } @names;
  }    
  my @newest = get_ten_newest_files("/home/merlyn/Mail/inbox");
  for (@newest) {
    my ($name, $handle) = @$_;
    print "$name: $_" for <$handle>;
  }

Wow, lots of stuff, but hopefully you can see the meat in the middle. For each name in @names being returned by the subroutine, we transform the name into a two element array, the second of which is a brand new filehandle for each filename. The main code pulls out the filenames, and dumps the filehandles, which generates just the bodies.

If I've been following the development direction correctly, the next major release of Perl after 5.005_03 will eliminate the need for all those

  my $x = do { local *X };

steps, by treating any undef variable used by open as if it has a filehandle symbol already installed. Joy! But this trick will continue to work, and isn't really that much additional typing.

I hope you've enjoyed this little excursion into subroutine references, and find them part of your bag of Perl tricks. For further information, check the documentation that comes with Perl, especially perlref, as well as chapter 4 of my book Programming Perl, Second Edition from O'Reilly and Associates (co-authored by Larry Wall and Tom Christiansen). Until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.