Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 18 (Nov 2000)

[Suggested title: Discovering incomprehensible documentation]

Ahh, manpages. Some of them are great. But a few of them are just, well, incomprehensible.

So I was sitting back a few days ago, wondering if there was a way to locate the ugly ones for some sort of award, and then I remembered that I had seen a neat module called Lingua::EN::Fathom which could compute various statistics about a chunk of text or a file, including the relative readability indicies, such as the ``Fog'' index. The ``Fog'' index is interesting in particular because it was originally calibrated to be an indication of the ``grade level'', with 1.0 being ``first grade text'' and 12.0 as ``high school senior''. At least, that's the way I remember it.

While I don't believe in the irrational religion applied to these indicies sometimes (``We shall have no documentation with a Fog factor of higher than 10.0''), I do think they are an indicator that something is amiss.

So, in an hour or so, I hacked out a program that wanders through all of the manpages in my MANPATH, extracts the text information (discarding the troff markup), and sorts them by Fog index for my amusement. Since I brought together a lot of different technologies and CPAN modules to do it, I thought I'd share this program with you, found in [listing one, below].

Line 1 gives the path to Perl, along with turning on warnings. I usually don't trigger any warnings, but it's nice to run with the safetys on occasionally.

Line 2 enables the normal compiler restrictions, requiring me to declare all variables, quote all quoted strings (rather than using barewords), and prevents me from using those hard-to-debug symbolic references.

Line 3 ensures that each print operation on STDOUT results in an immediate I/O operation. Normally, we'd like STDOUT to be buffered to minimize the number of system calls, but since this program produces a trace of what's happening, I would kinda like to know that it's happening while it is happening, not after we got 8000 bytes to throw from a buffer. As an aside, some people would prefer that I use $| = 1; here because it would be clearer. But I find the $|++ form easier to type, and I saw Larry do it once, so it must be blessed.

Line 6 provides the only configuration variable for this program: the location of the memory to be used between invocations. Running the text analysis on the data each time is expensive (especially while I was testing the report generator at the bottom of the program), so I'm keeping a file in my home directory to hold the results. The filename will have an extension appended to it, depending on the chosen DBM library.

Line 9 is what got me started on this program: a module from the CPAN to compute readability scores. As this is not part of the standard distribution, you'll need to install this yourself.

Line 10 provides the two constants I needed for the later DBM association.

Line 11 pulls in the ``multi-level database'' adaptor. MLDBM wraps the fetch and store routines for a DBM tied hash so that any reference (meaning a data structure) is first ``serialized''. The result is that a complex data structure is turned into a simple byte string during storage, and when retrieved, the reverse occurs, so that we get a similar data structure again. There are interesting limitations, but none of them got in my way for this program.

The args to the use indicate that we want to use DB_File as our DBM, and Storable as our serializer. DB_File is found in the Perl distribution, but you must have installed ``Berkeley DB'' before building Perl for this to be useful. Replace that with SDBM if you can't find DB_File. Storable is also found in the CPAN, and is my preferred serializer for its robustness and speed. Data::Dumper can also be used here, with the advantage that it's the default.

Line 12 selects the ever-popular File::Find module (included in the distribution) to recurse downward through the man directories to scout out the manpage files.

Line 13 enables simple trapping of signals with a trivial die operation. I found that without this, if I pressed control-C too early in the process, none of my database had been updated, which makes sense after I thought about it. (An interrupt stops everything, not even giving Perl a chance to write the data to the file by closing the database cleanly.)

Line 15 associates %DB with the multilevel database named in $DATAFILE. The remaining parameters are passed to the underlying DB_File tie operation, and select the creation as needed of the database, and the permissions to give to the file if needed.

Line 17 sets up a global @manpages variable to hold the manpages found by the subroutine in lines 20 through 23.

Lines 19 through 24 walk through the directories named in my MANPATH, looking for manpages. First, the MANPATH is split on colons, then each element is suffixed with slash-period-slash. As far as File::Find is concerned, this doesn't change the starting directories, but the presence of this marker is needed later to distinguish the prefix directory from the location within that directory, as we'll see in line 29.

The anonymous subroutine starting in line 19 is called repeatedly by File::Find's find routine. The full name of each file can be found in $File::Find::name, while $_ is set up properly together with the current directory to perform file stat tests. The conditions I'm using here declare that we're looking for a plain file (not symbolic link) that isn't named whatis, and that it not be too big or too small. If it's a go, the name gets stuffed at the end of @manpages.

Line 26 creates the text analyser object. I humored myself at the time by calling it $fat, which originally was a shortened form of ``fathom''. As I write this text the next day, I can't remember why I found that funny. I guess it's meta-funny.

And now for the first big loop, in lines 28 to 48. This is where we've got the list of manpages, and it's time to go see just how awful they are.

Line 29 pulls apart the $dir, which is the original element of my MANPATH, from the $file, which is the path below that directory. This is possible because we included the slash-dot-slash marker in the middle of the path during the filename searching, and necessary because the troff commands of the manpages presume that the current directory is at the top of the manpage tree during processing, particularly for the .so command which can bring in another manpage like an include file.

Line 30 refixes the name to avoid the marker, and line 31 shows us our progress with that updated name.

Lines 32 through 36 keep us from rescanning the same file. First, the modification timestamp is grabbed into $mtime. Next, we check the existing database entry (if any) to see if the recorded modification time from a previous run is the same as the modification time we've just seen. If they're the same, we've already done this file on some prior run, and we can skip this one altogether. If not, we gotta get our hands dirty on it instead.

Line 38 is where this program spends most of its time. We have a deroff command that reads a troff file, and removes most of the troff embedded control sequences and commands. While it's not perfect, it's fairly useful, and close enough for this demonstration. And we need to be in that parent directory so that relative filenames work; that's handled with the simple one-liner shell command inside the backquotes.

``But wait'', you may ask, ``I don't have a deroff!'' Never fear. I ran into the same problem myself. A quick search on the net (thank you, www.google.com!) revealed that this had been one of the already completed commands in the Perl Power Tools project, archived at http://language.perl.com/ppt/. So, I downloaded the .tar.gz file from that page, extracted the pure Perl implementation of deroff, and installed it quite nicely. Yes, there's a few open-source C versions out there, but I didn't want to futz around.

Line 39 detects a failure in the attempt to deroff the text, and moves along if something broke. Nothing went wrong in the hundreds of files I analyzed, but ya never know.

Line 40 is where this program does some heavy CPU on its own. The text of the deroff'ed manpage is crunched, looking for the various statistics, including our readability scores. There didn't appear to be any error return possible from this call, so I didn't try to detect one.

Line 42 creates the %info data structure to hold the attributes of this particular file that we want to store into the database. We'll start with the modification time that we fetched earlier, to ensure that later passes will go ``hey, I've already seen this version''.

Lines 43 through 45 use the $fat object to access the three scores, via the fog, flesch, and kincaid methods. I've used a nice trick here: an ``indirect method call'', where the name of the method comes from a variable. The result is as if I had said:

  $info{"fog"} = $fat->fog();
  $info{"flesch"} = $fat->flesch();
  $info{"kincaid"} = $fat->kincaid();

But with a lot less typing. (That is, until just now, to illustrate the point.)

Line 46 stores the information into the database. The value is a reference to the hash, but the MLDBM triggers a serialization, so that the actual DBM value stored is just a byte string that can be reconverted into a similar data structure upon access. And in fact, an access already potentially occurred up in line 33. The access to $DB{$name} fetched a byte string from the disk database, which was then reconverted into a hashref so that the subsequent access to the hashref element with a key of mtime would succeed.

Line 47 lets us know we did the deed for this file, and are moving on.

And that completes the data gathering phase, so it's now time to do the report, as indicated in line 50.

Line 54 is a quick trick with interesting performance consequences. The hash of %DB acts like a normal hash, but actually involves two levels of tied data structures. This can be quite slow, especially when performing repeated accesses for sorting. So, we copy the entire database as an in-memory hash in one brief operation, to the %db hash. Now we can use %db in the same we we would have used %DB, but without the same access expense. Of course, since it's a copy, we can't change the real database, but that's not needed here.

Lines 55 to 57 sort the database by the key specified in $kind, defined in line 52. We've got a descending numeric sort, to put the worst offenders first. A simple printf makes the columns line up nicely.

And my output from running this program looks something like [listing two, below]. Yeah, that first file ranked in at a whopping ``grade 167'' education to read it. In theory. And about 5th grade or 6th grade for the simplest few. As a comparison, the text of this column (before editing) came out at around 13.3 on the fog index. Hmm. I hope you all made it through high school! Until next time, keep your sentences short, and to the point. Enjoy!

Listings

        =0=     ##### LISTING ONE #####
        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ## config
        =6=     my $DATAFILE = "/home/merlyn/.manfog";
        =7=     ## end config
        =8=     
        =9=     use Lingua::EN::Fathom;
        =10=    use Fcntl qw(O_CREAT O_RDWR);
        =11=    use MLDBM qw(DB_File Storable);
        =12=    use File::Find;
        =13=    use sigtrap qw(die normal-signals);
        =14=    
        =15=    tie my %DB, 'MLDBM', $DATAFILE, O_CREAT|O_RDWR, 0644 or die "Cannot tie: $!";
        =16=    
        =17=    my @manpages;
        =18=    
        =19=    find sub {
        =20=      return unless -f and not -l and $_ ne "whatis";
        =21=      my $size = -s;
        =22=      return if $size < 80 or $size > 16384;
        =23=      push @manpages, $File::Find::name;
        =24=    }, map "$_/./", split /:/, $ENV{MANPATH};
        =25=    
        =26=    my $fat = Lingua::EN::Fathom->new;
        =27=    
        =28=    for my $name (@manpages) {
        =29=      next unless my ($dir, $file) = $name =~ m{(.*?)/\./(.*)}s;
        =30=      $name = "$dir/$file";
        =31=      print "$name ==> ";
        =32=      my $mtime = (stat $name)[9];
        =33=      if (exists $DB{$name} and $DB{$name}{mtime} == $mtime) {
        =34=        print "... already computed\n";
        =35=        next;
        =36=      }
        =37=    
        =38=      my $text = `cd $dir && deroff $file`;
        =39=      (print "cannot deroff: exit status $?"), next if $?;
        =40=      $fat->analyse_block($text);
        =41=    
        =42=      my %info = ( mtime => $mtime );
        =43=      for my $meth (qw(fog flesch kincaid)) {
        =44=        $info{$meth} = $fat->$meth();
        =45=      }
        =46=      $DB{$name} = \%info;
        =47=      print "... done\n";
        =48=    }
        =49=    
        =50=    print "final report:\n\n";
        =51=    
        =52=    my $kind = "fog";
        =53=    
        =54=    my %db = %DB;                   # speed up the cache
        =55=    for my $page (sort { $db{$b}{$kind} <=> $db{$a}{$kind} } keys %db) {
        =56=      printf "%10.3f %s\n", $db{$page}{$kind}, $page;
        =57=    }
        =0=     ##### LISTING TWO #####
        =1=     final report:
        =2=     
        =3=        167.341 /usr/lib/perl5/5.00503/man/man3/WWW::Search::Euroseek.3
        =4=        154.020 /usr/lib/perl5/5.00503/man/man3/GTop.3
        =5=         65.528 /usr/lib/perl5/5.00503/man/man3/Tk::X.3
        =6=         56.616 /usr/man/man1/mh-chart.1
        =7=         45.591 /usr/man/man1/tar.1
        =8=         40.133 /usr/lib/perl5/5.00503/man/man3/Bio::SeqFeatureI.3
        =9=         39.012 /usr/lib/perl5/5.00503/man/man3/XML::BMEcat.3
        =10=        37.714 /usr/lib/perl5/5.00503/man/man3/less.3
        =11=        37.200 /usr/lib/perl5/5.00503/man/man3/Business::UPC.3
        =12=        36.809 /usr/lib/perl5/5.00503/man/man3/Number::Spell.3
        =13=    [...many lines omitted...]
        =14=         7.179 /usr/man/man1/tiffsplit.1
        =15=         7.174 /usr/lib/perl5/5.00503/man/man3/Tie::NetAddr::IP.3
        =16=         7.018 /usr/lib/perl5/5.00503/man/man3/DaCart.3
        =17=         6.957 /usr/man/man3/form_driver.3x
        =18=         6.899 /usr/man/man7/samba.7
        =19=         6.814 /usr/lib/perl5/5.00503/man/man3/Array::Reform.3
        =20=         6.740 /usr/lib/perl5/5.00503/man/man3/Net::GrpNetworks.3
        =21=         6.314 /usr/man/man5/rcsfile.5
        =22=         6.210 /usr/lib/perl5/5.00503/man/man3/Network::IPv4Addr.3
        =23=         6.002 /usr/lib/perl5/5.00503/man/man3/Net::IPv4Addr.3
        =24=         5.881 /usr/man/man8/kbdrate.8
        =25=         5.130 /usr/lib/perl5/5.00503/man/man3/Net::Netmask.3

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.