Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 43 (Dec 2002)

[suggested title: Browsing a local CPAN mirror]

Last month, I showed how to fetch a subset of the CPAN (Comprehensive Perl Archive Network) to create a mini-mirror. The subset included just the latest distribution of each module, plus the index files so that the CPAN.pm module could install and update your local modules.

One use of having a mini-mirror is to install CPAN modules when I'm disconnected from the net (like when I'm on a cruise ship for www.geekcruises.com or at 30,000 feet jetting to another Perl training site). But I often find myself just browsing through the recent additions to the CPAN to see what's new, what's cool, and what's being updated. This is easy to do online, because search.cpan.org provides a ``recent additions'' link. But offline, the data is much less readily available, since the RECENT file shows only a few days of past activity. Worse yet, my mini-mirror doesn't download either the RECENT file or any of the README extractions for the distributions.

So, I started wondering if there was a way I could use just the mini-CPAN created by last month's column, and still browse the newest distributions, and even better, dumping out the README files for those distributions if they existed. Surely, the information was there in the form of the mirrored timestamps on the distributions themselves. And the README files, while not extracted, were certainly present inside the tar.gz files. And that led me to the program presented in [the listing below].

First off, lines 1-3 start nearly every Perl program I write, enabling warnings, compile-time restrictions, and unbuffering STDOUT.

Line 7 is the only configuration parameter of this script. It needs to be a Unix path to the top of the CPAN mirror (with subdirectories author and modules immediately within this directory). You can use either a full CPAN mirror, or the mini-CPAN mirror created with last month's column.

Lines 11 to 19 pull in the various modules needed for this program. From the core distribution, we need the catfile function, and the Safe module (reasons explained later). From the Compress::Zlib module (found in the CPAN), we'll get the gzopen function and the $gzerrno variable. And finally, we'll pull in the Archive::Tar module, also found in the CPAN.

The main program lives in lines 21 to 34. We're going to page the output by days. The first output will be everything that was uploaded within the past 24 hours, and then the program pauses, waiting for me to hit return, after which the second day of output is shown, and so on. The $days_ago variable defined in line 21 manages this part of the process.

Lines 22 to 32 loop over each ``distro'', consisting of a hash reference to the particulars of a given distribution. We'll do this by calling a subroutine (defined later) in line 22 to get all the distros as a big list, sorted by age from newest to oldest.

The modification age in days of each distro is extracted in line 25, and compared to the current upper boundary. If we've gone too far, then we'll prompt in line 26, and wait for a return in line 27 (discarding the actual input). A prompt in line 28 lets us know how old we've gotten.

Of course, once we've finished hassling around with the date stamping, the next step is to do the real work, triggered from line 31. We'll see that subroutine defined later.

Lines 36 to 59 handle the scanning of the existing distros, to figure out the order of the distros from most recent to oldest. Lines 37 and 38 declare a ``process this only once'' hash and the array to hold the list of resulting distros.

Line 40 computes the ``package details'' pathname, as a native filepath. Line 41 calls a subroutine with this pathname to fetch the file, uncompress it, and return a list of all the lines of the file after the first blank line. Each line represents one package name. Line 42 splits the whitespace delimited line into the interesting parts. Line 43 skips over anything that looks like a Perl distribution, and line 45 rejects any distros we've seen already, to ensure one pass each.

Line 48 computes a native filename for the CPAN-mirror location of this particular distro. Note that the indicies always show Unix-style forward slashes, but we'll pass separate elements to catfile so that it just does the right thing to construct a full path.

Line 50 constructs a distro record for this distro, as a new anonymous hash (reference) pushed onto the @distros array. We'll note the filename of the CPAN mirror file, the identification path, the module name (or at least one of the modules of this distribution), and the modification time as an internal timestamp (the usual ``seconds since the Unix epoch'' value).

Line 58 does a simple sort-block sort to bring out all the distros ordered by their modification timestamp order. Note that $b appears before $a, so we get a descending sort, resulting in newest first, just as we promised.

Lines 61 to 80 take care of showing each chosen distro, which is passed as a parameter and shifted off the @_ array in line 62.

Line 64 fetches the module information data, calling a subroutine to return the value. On the first call, this subroutine will do a lot of work, but the subroutine caches the value for subsequent calls. The format of the response is a reference to a hash of records, each record being a keyed hash.

Line 65 opens up the distro file, presuming it is openable with the Archive::Tar module, which includes both ordinary tar files and compressed tar files. The resulting Archive::Tar object can then be queried and extracted.

Line 67 fetches the module description of the current distro, always a short phrase if it's present. Since some modules have no description, we'll ``or'' this with an empty string to keep from getting an undef usage error later, particularly in line 68, which displays the short path for the distro and its description.

Line 70 queries the Archive::Tar object for all of the contained files that look like a README file. These usually have some cool information in them, and that's what we're looking for to see how cool the module can be.

Lines 72 to 78 display each README file. The name is indented by a pipe-space. The content is then fetched in line 74, and processed to have two pipe-spaces in front of each line. Regardless of whether the README file uses Mac, Unix, or DOS line endings, and whether we're running on any of those architectures, this code pretty much does the right thing. Note the use of \cM instead of \r, because on a Mac (and maybe Windows, I forget), the \r and \n are swapped around. See perldoc perlport for all the gruesome details.

Next, we've got the get_module_data subroutine defined in lines 85 to 96. However, this subroutine performs a rather expensive and messy operation, so we'll cache the output in a static lexically local variable defined in line 83. The variable and the subroutine are enclosed in a BEGIN block to ensure proper closure between them.

The meat of the subroutine is the block between lines 87 and 94, which generates the value we desire cached. The ||= do start to the block ensures that we'll perform the block only when the current value of $data is false, such as when it is undef initially. If the value is not false, we simply return it.

The block first computes the name of the ``modlist data'' file in line 87, again in a way that should be portable regardless of the filename syntax. The modlist data file contains Perl code to be executed. However, we don't want to execute arbitrary code from a website, lest our box be owned by some scriptkiddie who managed to replace the file with nefarious code. So we'll execute this code in a Safe environment. I'll admit that I stole most of this code semantics from the existing CPAN.pm module source, although I rewrote most of the syntax. The code to be executed comes from uncompressing the ``modlist data'' file in line 91. A call to CPAN::Modulelist->data returns the value from the code, which ends up in $ret declared in line 89. If anything goes wrong, we'll abort with an appropriate error message.

Lines 99 to 115 define the weirdly-named uncompress_and_grab_after_blank routine. Oddly enough, both the ``module list'' and the ``module data'' files are compressed, and have an unneeded header delimited from the body by a blank line, so we get to use this routine twice. The file name comes in from the first parameter (line 100). Line 101 defines the state flag (we're in the header initially, so all we're doing is looking for the blank line signifying the end of the header). Line 102 defines the return value (initially an empty list).

Line 104 opens up the compressed file, returning a gzip object handle.

Lines 106 to 113 read the file line-by-line until end-of-file. If we're in the header, we're looking for a blank line. If we're not in the header, then the line is attached to the end of the @return value. When we're done, the list of lines is returned in line 114.

Lines 117 to 126 define the last subroutine, get_archive_tar_for. The incoming parameter is the filename to open, shifted off in line 118. Lines 119 to 123 create the Archive::Tar object. However, I found that the library sometimes spits raw compressed data to standard error, and that was pretty nasty on my terminal. To get around that, I create a local STDERR glob, then reopen the STDERR filehandle to a /dev/null-ish sort of thing, which took care of the junk on my screen from a bad open. Line 124 turns any die from inside the eval block into a mere warn. Line 125 returns the final Archive::Tar object, if any.

And there it is: my mini-CPAN mini-readme-browser. In playing with this for the past few weeks, I've already seen about a dozen new cool modules that I'm now investigating, so you can be sure that I'll be writing about them in future columns. So until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ### CONFIG
        =6=     
        =7=     my $LOCAL = "/Users/merlyn/MIRROR/MINICPAN/";
        =8=     
        =9=     ### END CONFIG
        =10=    
        =11=    ## core -
        =12=    use File::Spec::Functions qw(catfile devnull);
        =13=    use Safe qw();
        =14=    
        =15=    ## Compress::Zlib -
        =16=    use Compress::Zlib qw(gzopen $gzerrno);
        =17=    
        =18=    ## Archive::Tar -
        =19=    use Archive::Tar qw();
        =20=    
        =21=    my $days_ago = 0;
        =22=    for my $distro (get_distro_sorted_by_age()) { # list of hashrefs
        =23=    
        =24=      ## paging by days old
        =25=      unless ((time - $distro->{modtime})/86400 < $days_ago + 1) {
        =26=        print "[more]\n";
        =27=        <STDIN>;
        =28=        print ++$days_ago, " days ago:\n";
        =29=      }
        =30=    
        =31=      show_distro($distro);
        =32=    }
        =33=    
        =34=    exit 0;
        =35=    
        =36=    sub get_distro_sorted_by_age {
        =37=      my %seen;
        =38=      my @distros;
        =39=    
        =40=      my $details = catfile($LOCAL, qw(modules 02packages.details.txt.gz));
        =41=      for (uncompress_and_grab_after_blank($details)) {
        =42=        my ($module, $version, $path) = split;
        =43=        next if $path =~ m{/perl-5}; # skip Perl distributions
        =44=    
        =45=        next if $seen{$path}++;
        =46=    
        =47=        ## native absolute file:
        =48=        my $local_file = catfile($LOCAL, split "/", "authors/id/$path");
        =49=    
        =50=        push @distros, {
        =51=                        filename => $local_file,
        =52=                        path => $path,
        =53=                        module => $module,
        =54=                        modtime => (stat($local_file))[9],
        =55=                       };
        =56=      }
        =57=      ## return distros sorted by descending modtimes
        =58=      sort {$b->{modtime} <=> $a->{modtime}} @distros;
        =59=    }
        =60=    
        =61=    sub show_distro {
        =62=      my $distro = shift;
        =63=    
        =64=      my $data = get_module_data();
        =65=      my $at = get_archive_tar_for($distro->{filename}) or return;
        =66=    
        =67=      my $description = $data->{$distro->{module}}{description} || "";
        =68=      print "$distro->{path} ($description)\n";
        =69=    
        =70=      my @readmes = sort grep m{/README\z}, $at->list_files();
        =71=    
        =72=      for my $readme (@readmes) {
        =73=        print "| $readme\n";
        =74=        my $content = $at->get_content($readme);
        =75=        for ($content =~ /([^\cM\cJ]*)\cM?\cJ?/g) {
        =76=          print "| | $_\n";
        =77=        }
        =78=      }
        =79=    
        =80=    }
        =81=    
        =82=    BEGIN {
        =83=      my $data;                     # cached value
        =84=    
        =85=      sub get_module_data {
        =86=        $data ||= do {
        =87=          my $modlist = catfile($LOCAL, qw(modules 03modlist.data.gz));
        =88=          no strict;
        =89=          my $ret = Safe->new("CPAN::Safe1")->
        =90=            reval(join("",
        =91=                       uncompress_and_grab_after_blank($modlist),
        =92=                       "CPAN::Modulelist->data"));
        =93=          die $@ if $@;
        =94=          $ret;
        =95=        };
        =96=      }
        =97=    }
        =98=    
        =99=    sub uncompress_and_grab_after_blank {
        =100=     my $file = shift;
        =101=     my $inheader = 1;
        =102=     my @return = ();
        =103=   
        =104=     my $gz = gzopen($file, "rb") or die "Cannot open $file: $gzerrno";
        =105=   
        =106=     while ($gz->gzreadline($_) > 0) {
        =107=       if ($inheader) {
        =108=         $inheader = 0 unless /\S/;
        =109=         next;
        =110=       }
        =111=   
        =112=       push @return, $_;
        =113=     }
        =114=     @return;
        =115=   }
        =116=   
        =117=   sub get_archive_tar_for {
        =118=     my $filename = shift;
        =119=     my $at = eval {
        =120=       local *STDERR;
        =121=       open STDERR, ">".devnull();
        =122=       Archive::Tar->new($filename) or die "Archive::Tar failed on $filename\n";
        =123=     };
        =124=     warn $@ if $@;
        =125=     $at;
        =126=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.