Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in SysAdmin/PerformanceComputing/UnixReview magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Unix Review Column 45 (Jan 2003)

[suggested title: ``The Duct Tape of the Internet'']

When you're a Perl programmer, you never fret about those little ugly tasks that creep up. Perl can deal with file wrangling, text manipulation, and process management in a way unequaled by any other single language, whether open-source or proprietary.

For example, let's take a simple file and text wrangling task, and see how I solved it with Perl. Having been a system administrator for many years, I'd say that this task is representative of those niggling little things that I would face, typically daily, in the course of my job.

Nearly all Perl modules contain embedded documentation, called ``POD'' (described by perldoc perlpod). When I install a module from the Comprehensive Perl Archive Network (the ``CPAN'': see www.cpan.org for further information), the module is usually installed into a place that my Perl binary can find it (along Perl's @INC path). By default, the installation process also creates an nroff -man page, so that the man command can display a nicely formatted version (presuming you extend your MANPATH or equivalent). Thus, for most modules, you can say either perldoc Some::Module (to convert the embedded POD into text), or man Some::Module (to display the preprocessed man page).

However, the server that runs www.stonehenge.com runs OpenBSD (mostly so I can sleep at night knowing that security is a key point of the OpenBSD developers). The default Perl installation of OpenBSD is configured in such a way that the man pages are not generated for non-core Perl modules. Instead, I'm expected to type perldoc Some::Module to get the documentation for the module, instead of my more familiar man Some::Module, except that I can use man for the core modules. As I find this rather confusing, I faced two alternatives:

  1. I could hack the core installation of Perl so that it would install man pages, thereby risking breakage if the Perl installation was upgraded during a minor or major release.

  2. I could write a simple tool to take all the embedded POD and generate man pages into my private area.

I decided to write a simple tool, mostly because I'm opposed to touching anything in the core distribution, since I have no idea if someone at OpenBSD headquarters is likely to change things out from under me.

And a simple tool it is, although it's about 80 lines of Perl code. So, looking at a few lines at a time, let's see what I wrote, in about the order that I created the lines.

First, I started with my normal header:

    #!/usr/bin/perl -w
    use strict;
    $|++;

With these three lines, I've turned on warnings, enabled the common compiler restrictions (undeclared variables, soft references, and barewords are all disabled), and turned off the buffering for STDOUT.

Next, I put a few configuration lines that I might change, based on where I was running the program:

    ## BEGIN configuration
    my $MAN3DIR = "/home/merlyn/man/man3";
    my $MAN3EXT = "3p";
    ## END configuration

Here I've defined a location below my home directory where I've placed other personal manpages, and an extension for the specific Perl module pages. Traditionally, Perl modules have the 3p extension and are placed in section 3 of the Unix manual. I've added /home/merlyn/man to my MANPATH, so the man command finds this directory just fine.

    use Pod::Man;
    use File::Find;
    use Config;

Following that, I bring in the 3 modules (all in the Perl core distribution) that I'll need to wander through the installed directories and find the POD files. The Pod::Man module can convert POD into manpages. The File::Find module recurses through subdirectories. The Config module provides a hash interface to the configuration parameters for the installed Perl. In fact, the next two lines use that hash to locate two specific directories:

    my $SITELIB = $Config{sitelib};
    my $SITEARCH = $Config{sitearch};

The value for $SITELIB gives the path in which local Perl modules are installed. $SITEARCH provides a similar path for architecture-specific modules: those which contain binary files resulting from compiling C (or other languages). Generally, the $SITEARCH directory will be within the $SITELIB directory, and this program presumes that.

Next, we'll create a Pod::Man object configured for the task:

    my $podmanparser = Pod::Man->new(section => $MAN3EXT);

The section value gives the name appearing in the page header banner, mostly cosmetic, but nice to get right.

Now comes the task of finding the existing POD documentation. So, after a few tries, I came up with the following loop with File::Find:

    my %pods;
    find sub {
      return unless /\.p(m|od)$/;
      my $package = $File::Find::name;
      for ($package) {
        s{^\Q$SITEARCH/}{}
          or s{^\Q$SITELIB/}{}
            or die "Cannot remove $SITEARCH or $SITELIB from $File::Find::name\n";
        s/\.p(m|od)$//
          or die "What happened to the ext in $package?\n";
        s{/}{::}g;
      }
      push @{$pods{$package}}, $File::Find::name;
    }, $SITELIB;

There's a lot going on here, and it's best to work from the outside in. The find subroutine has been imported from File::Find, and is presented with a subroutine reference (here, an anonymous subroutine) and a starting path, $SITELIB. The find routine starts at the top directory, recursing down, calling the subroutine for each found entry (even ones in which we're not interested). The line

      return unless /\.p(m|od)$/;

rejects the filenames that aren't either Perl modules or Perl POD files by looking at $_, which contains the basename (no directory part) of the file or directory being examined. The next few lines extract the package name for the filename into $package. First, we take the full path from $File::Find::name, then remove either the $SITEARCH or $SITELIB prefix from the path. If neither of these succeed, then something has gone terribly wrong, so we'll abort.

Next, the lines:

        s/\.p(m|od)$//
          or die "What happened to the ext in $package?\n";
        s{/}{::}g;

turn the remainder of the name into a module name, by replacing the slashes with double-colon package delimiters, and stripping off the extension. Finally, the loop adds this file name to an arrayref contained within the %pods hash, indexed by the package name. Why a list? Because many modules have a separate POD file, so we'll see both <Some/Module.pm> and Some/Module.pod. We'll sort out later which of these to use for the manpage, but we'll record them all for now.

When this loop has completed, we have a hash %pods, keyed by package name, with each entry being a list of one or more files that may contain the documentation for that module.

When I showed this program to one of my friends, they then commented (only after I toiled over this part), ``Why didn't you just use Pod::Find?''. Ah, yes. If I'd only known, I could have reduced this part of the program to a few lines of code. I'll have to file that away for use in a future program. The lesson here is ``always check the CPAN first, because any interesting task is likely already written''.

The next step is to wander through the hash, and do whatever it takes to update the manpages if needed. We'll start with a loop like this:

    POD: for my $pod (sort keys %pods) {
      my @files = @{$pods{$pod}};
      ... more code here ...
    }

I had to name the loop because we'll see a point later where I want to execute a next against this loop even though I'm in a nested loop. So, $pod contains a package name, and @files contains one or more source files for that package. Next, we need to figure out which one of many source files is needed if there's more than one:

      if (@files > 1) {         # more than one?  must sort
        @files = sort {
          ## primary: prefer arch-specific over non-arch-specific
          to_boolean($b =~ m{^\Q$SITEARCH}) <=> to_boolean($a =~ m{^\Q$SITEARCH})
            ## secondary: prefer .pod to .pm
            or to_boolean($b =~ /\.pod$/) <=> to_boolean($a =~ /\.pod$/);
        } @files;
      }
      my $file = shift @files;  # first one is always best now

Again, a lot of stuff going on here. If there's more than one file, we'll sort it, preferring architecture-specific files over generic files, and .pod files over .pm files. The first entry in the list after sorting (or the only entry in the list if there was only one to start with) is now the most likely candidate for our manpage.

The to_boolean routine forces false to have 0 and true to have 1 so that we can sort nicely:

    sub to_boolean {
      $_[0] ? 1 : 0;
    }

Next, we'll figure out the name of the manfile, and whether or not we have any work to do:

      my $manfile = "$MAN3DIR/$pod.$MAN3EXT";
      next if
        -e $manfile and
          -M $manfile < -M $file;       # skip if exists and newer

If the manpage file exists, and is newer than our source file, we've got nothing to do, so we go on to the next entry.

At this point, we have a source file (either POD or Perl file) which has not yet been updated into a manpage. However, the file may still contain no POD directives. We need to look for some POD in the file. The easiest way is to look for =head at the beginning of a line. This isn't entirely accurate, but it's the same rule that the perldoc command uses, so I figure it's close enough. And that code came out like this (after a few tries):

      open IN, $file
        or warn("Cannot open $file, skipping\n"), next POD;
      while (<IN>) {
        if (/^=head/) {         # POD sign!
          print "pod2man $file $manfile\n";
          not -e $manfile or unlink $manfile
            or warn("Cannot remove $manfile: $!\n");
          open OUT, ">$manfile"
            or warn("Cannot create $manfile: $!\n"), next POD;
          seek IN, 0, 0;
          $podmanparser->parse_from_filehandle(\*IN, \*OUT);
          close OUT;
          next POD;
        }
      }

The meat is in the middle: once we've determined we have a decent POD file, we seek the file back to the beginning, and then call parse_from_filehandle to generate the manpage.

So, any time I suspect that there's been a new module added to my local install, I can run this program, and my local manpage collection is updated, with minimal effort.

A simple task, simply executed by Perl, but handling an important issue of letting me get at Perl's documentation with either perldoc or man, working around a vendor limitation. Most of those ``gotta get it done now with no time to do it'' system administration tasks seem to be about this large, and as you can see, Perl fits the task nicely. So, until next time, enjoy!


Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.