Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Linux Magazine Column 42 (Nov 2002)

[suggested title: Mirroring your own mini-CPAN]

The Comprehensive Perl Archive Network, known as ``the CPAN'', is the ``one stop shopping center'' for all things Perl. This 1.2 GB archive contains over 13000 modules for inclusion in your programs, as well as scripts, documentation, many non-Unix Perl binaries, and other interesting things.

Although there's nearly always a good fast CPAN archive nearby when you are connected to the net, sometimes you're connected to the net at different speeds (like quickly at work, but slowly at home or vice versa), or not at all. And what do you do then when you're like me, at 30,000 feet jetting off to yet another conference or customer site, and you realize you need a module that you haven't yet installed on your laptop? (This is especially an issue when a deadline for a magazine column looms close.)

Well, for the past year or so, I've been mirroring the entire CPAN to my laptop, thanks to the permission and cooperation of the owner of one of the major archive sites (and a few carefully constructed rsync commands). But at a recent conference, someone said ``hey, can you just burn that onto a CD for me?'', and I was stuck. The current CPAN exceeds the size of a CD-ROM, even though only a small portion of the files are needed for module installation!

So that got me thinking. If I brought down only the files that were needed by CPAN.pm to perform the installation of the latest release of a module, how big would that be? And the answer was wonderfully surprising: a bit more than 200 meg, which easily fits on a CD-ROM.

Unfortunately, I didn't see any clean, easy-to-use, efficient ``mirror only the latest modules of the CPAN'' program out there, so I wrote my own, which I present in [listing one, below].

Lines 1 through 3 start nearly every long program that I write, enabling warnings, compiler restrictions, and disabling buffering on STDOUT.

Lines 5 through 17 form the configuration section of this program. There's really only three things to set here.

$REMOTE is the URL prefix leading to the nearest CPAN archive. The uncommented value is the main United States CPAN archive. The next value is the Finland archive, which also happens to be the master archive. If you want the most up-to-date sources, they're here. And because I was initially developing this program at the annual SAGE-AU conference in Australia, the value following that is the Australian CPAN archive. Finally, I have a complete CPAN archive on my laptop's disk already, so I can point to that with a file: URL as well, as shown by the fourth value.

That's the source, and we need to define a destination, and that's in $LOCAL. This is a simple Unix path. If you're on a non-Unix system, you can specify this in the local directory syntax, since we'll be using the cross-platform File::Spec library to manipulate this path. And, as the comment warns, this program owns the contents of that directory, and is free to delete anything it sees fit, so keep that in mind as you are specifying the path.

Finally, a simple true/false $TRACE flag decides whether this program is noisy by default or quiet by default. The noise is limited to actual activity, and reassures me during execution that something is happening.

Next, from lines 20 to 30, we'll pull in the necessary modules. The standard Perl bundle gives us the dirname, catfile, and find routines. The optional CPAN-installable LWP library gives us the URI object module and the mirror routine (and some associated status values). And Compress::Zlib lets us expand the gzip-compressed index file so we know what distributions are needed for the mirror.

Once we've got everything set up, it's time to transfer everything needed for a typical operation of the core CPAN module (described by perldoc CPAN in a typical Perl installation). First, we need the index files, defined in lines 34 to 36. We'll call my_mirror on each of those, defined later. For now, we'll presume that this creates or refreshes each of those files below the $LOCAL-identified directory.

The 02packages.details.txt.gz file is a flat text file with a short header that contains the path to each distribution for each module in the CPAN. However, this file is gzip-compressed, so we need to expand the file to process the contents. Stealing the example out of the Compress::Zlib manpage nearly directly, lines 40 to 52 expand this file and extract the necessary information.

Line 40 constructs the filename in a platform-independent way by using the catfile routine. Note that we're actually passing three parameters. The first parameter is the value of $LOCAL, which serves as the starting point, from which we descend further to the subdirectory called modules, and thence finally to a file within that directory called 02packages.details.txt.gz. I've tested this only on Unix, but I'll presume that the program is portable, because I've used the portable functions.

Line 41 takes this constructed path, and creates a Compress::Zlib object, which can be asked to deliver the uncompressed file line-by-line. If that fails, we're in an unrecoverable state, and we'll abort.

The data contains a header, delimited by a blank line, so we need to skip over all the data up to and including that blank line. We'll do this by setting a flag to an initial 1 value in line 42. Line 43 reads a line at a time into $_, stopping when there is no more data (or there's an I/O error). Lines 44 to 47 look for the end of the header as long as we're still in the header. A header ends on a line that doesn't contain a non-blank character, hence the unless.

If we make it to line 49, we're staring at a standard line from the index, which looks something like

  Parse::RecDescent 1.80 D/DC/DCONWAY/Parse-RecDescent-1.80.tar.gz

The first column is the module name (here Parse::RecDescent), and is not very interesting to us. Neither is the second column, which is the current version number. But the third column contains (the unique part of) the path to the distribution for this module, and that's what the CPAN module will be looking for, and what we need to mirror.

Note that many module names will share the same common distribution file, so we'll need logic to avoid downloading duplicates. We'll defer that problem to the my_mirror subroutine.

A few of the modules are listed as belonging to a core Perl distribution. To avoid mirroring the various Perl distributions (and wasting space in our mirror), we'll skip over them in line 50. The regular expression is somewhat ad-hoc, but seems to do the right thing.

Line 51 mirrors the requested distribution into our local mirror. The 1 parameter says ``if it already exists, it's up to date'', and is an optimization based on external knowledge that a given distribution will never be updated in place. Rather, a new file will be created with a new version number. Of course, like any optimization, we do this with some hesitation and a bit of caution.

Once we've passed through the entire module list, we need to delete any outdated modules. A CPAN contributor has the option of leaving older versions of modules in the CPAN, or deleting them. We need to keep track of everything that is current, and delete anything not mentioned, in order to keep in sync with the master archive.

And that's it, as line 57 confirms.

But of course, that's not the whole story. We need to manage the mirroring. There are two steps to mirroring: fetching the files, and throwing away anything left over. These need to share a common hash, which we'll define as a closure variable inside a BEGIN block starting in line 59. The %mirrored hash in line 62 is keyed by the filename, and has a value of 1 to indicate that the file has been at least checked for existence, and 2 to indicate that it has been mirrored from the remote site and brought up to date. At the end of the run, any files that aren't either 1 or 2 for values are deleted files or temp files, and should be deleted from our mirror.

The my_mirror routine starting in line 64 does the hard work. The two parameters are the partial URL path and the ``skip if present'' flag.

In line 68, we use the URI module to construct the full URL, based on the $REMOTE value and the partial path. Line 69 constructs the local file path, based on $LOCAL and the partial path as well. The task for the remainder of the subroutine is to make the local file be up to date with respect to the remote URL.

Line 70 manages the checksum file. Each distribution is checksummed to ensure proper complete transfer. We'll first pretend that the checksum file doesn't need updating, but later remove that assumption if we end up transferring the distribution file.

Starting in line 72, we look at what to do to bring this file up to date. If $skip_if_present is true, then we'll never worry about the remote timestamp being out of sync. If the file is present, it's good enough, noted by the -f flag in line 72. Line 74 records that the file was at least checked for existence, so we don't delete it during the cleanup phase.

If $skip_if_present is not true, or the file doesn't exist, then it's time to do a full mirror on this distribution. We'll note that in line 77. Line 79 creates the directory to receive the file. (I would argue that LWP should do this for me, but that's not the way it works.) The $TRACE value causes a series of mkdir command-lines to be traced to the output; otherwise, this operation is silent. Line 80 also puts out some noise if $TRACE is set: note the absence of a newline, because we're going to follow on with a result status.

Line 81 is where the real work happens. We'll call mirror to bring the remote URL to the local file. This is done in such a way that the existing modification timestamp (if any) is noted and respected, minimizing the load on the remote server. And the file is actually written into a temp file, and then renamed only when the transfer is complete, thus ensuring that other users of this directory will not see partially transferred files at normal locations. (If one of these transfers aborts mid-way, the cleanup phase at the end of this program will delete the partial transfer). The modification time is also updated to that of the remote data, so that a later mirror will again note that the file is up to date.

The result of mirror is an HTTP status value. If it's RC_OK, then we've got a new version of the remote file. In this case, the checksum file may now be out of date: we can't merely check for its existence, so we'll flag that by setting the variable to 0 in line 84.

If the response is RC_NOT_MODIFIED, then we already had an up-to-date version of the file, and the remote server has informed us of such without even sending us a new version. In that case, we end up in line 90, finishing out the tracing message if needed.

However, if the status is neither of these, then something wrong has happened, and we'll generate a warn noting the status, and abort any further operation on this path by returning from the subroutine.

Once the distribution has been transferred, it's time to grab the checksum file. If the path is a distribution (checked in line 94), we'll compute the path to the CHECKSUMS file in lines 95 and 96. We must be careful to perform URL calculations here, not native path calculations. And, to keep the algorithm easy, we need to compute the path relative to the original CPAN mirror base, not a full path. Thankfully, this is also trivial with the URI module.

In line 97, if we're not already looking at a CHECKSUMS file, we need to call back to ourself to transfer the file. This is a clean tail-recursion, so I could have simply used a goto or a loop, but the subroutine call seemed easier and clearer at the time. If the checksum might already be up to date, it will merely be checked for its presence. If a transfer has taken place, a full mirror call will be issued instead.

Finally, we have the cleanup phase routine. We'll start at $LOCAL using the File::Find recursion. If a file exists, and it's not noted as such in the %mirrored hash (line 105), then we remove it (line 107).

And there you have it. Set up the configuration, and let it rip. On the first execution, you will want to be on a fast link (or a relatively unloaded time of day), because it downloads about 200 megabytes of data. After that, it's about 2-5 minutes per (average) day on a 28.8 link, which is completely tolerable for me from my hotel room when I'm on the road. And don't forget: you're downloading only installable modules, not the rest of the CPAN.

To use this mini-CPAN mirror with CPAN.pm, you'll need to enter at the CPAN prompt:

        o conf urllist unshift file://$LOCAL
        o conf commit
        reload index

Here, $LOCAL is replaced by the value you've set in $LOCAL but specified as a URL path (forward slashes for directory delimiters, and percent-escaped unusual characters). That's because CPAN.pm is expecting a URL, not a file path.

At the risk of repeating myself: this won't make CPAN installations any faster, unless you happen to be a road-warrior like me, needing to do CPAN installations when you are on a very slow net link (or no link at all). Of course, you could burn a daily CD for your friends, and ``hand them a CPAN archive on a disk'', providing a gateway between your bandwidth and the sneakernet. At least you won't be worrying trying to figure out how to fit the full 1.2+ GB CPAN on a CD-ROM! Until next time, enjoy!

Listing

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     ### CONFIG
        =6=     
        =7=     my $REMOTE = "http://www.cpan.org/";;
        =8=     # my $REMOTE = "http://fi.cpan.org/";;
        =9=     # my $REMOTE = "http://au.cpan.org/";;
        =10=    # my $REMOTE = "file://Users/merlyn/MIRROR/CPAN/";;
        =11=    
        =12=    ## warning: unknown files below this dir are deleted!
        =13=    my $LOCAL = "/Users/merlyn/MIRROR/MINICPAN/";
        =14=    
        =15=    my $TRACE = 1;
        =16=    
        =17=    ### END CONFIG
        =18=    
        =19=    ## core -
        =20=    use File::Path qw(mkpath);
        =21=    use File::Basename qw(dirname);
        =22=    use File::Spec::Functions qw(catfile);
        =23=    use File::Find qw(find);
        =24=    
        =25=    ## LWP -
        =26=    use URI ();
        =27=    use LWP::Simple qw(mirror RC_OK RC_NOT_MODIFIED);
        =28=    
        =29=    ## Compress::Zlib -
        =30=    use Compress::Zlib qw(gzopen $gzerrno);
        =31=    
        =32=    ## first, get index files
        =33=    my_mirror($_) for qw(
        =34=                         authors/01mailrc.txt.gz
        =35=                         modules/02packages.details.txt.gz
        =36=                         modules/03modlist.data.gz
        =37=                        );
        =38=    
        =39=    ## now walk the packages list
        =40=    my $details = catfile($LOCAL, qw(modules 02packages.details.txt.gz));
        =41=    my $gz = gzopen($details, "rb") or die "Cannot open details: $gzerrno";
        =42=    my $inheader = 1;
        =43=    while ($gz->gzreadline($_) > 0) {
        =44=      if ($inheader) {
        =45=        $inheader = 0 unless /\S/;
        =46=        next;
        =47=      }
        =48=    
        =49=      my ($module, $version, $path) = split;
        =50=      next if $path =~ m{/perl-5};  # skip Perl distributions
        =51=      my_mirror("authors/id/$path", 1);
        =52=    }
        =53=    
        =54=    ## finally, clean the files we didn't stick there
        =55=    clean_unmirrored();
        =56=    
        =57=    exit 0;
        =58=    
        =59=    BEGIN {
        =60=      ## %mirrored tracks the already done, keyed by filename
        =61=      ## 1 = local-checked, 2 = remote-mirrored
        =62=      my %mirrored;
        =63=    
        =64=      sub my_mirror {
        =65=        my $path = shift;           # partial URL
        =66=        my $skip_if_present = shift; # true/false
        =67=    
        =68=        my $remote_uri = URI->new_abs($path, $REMOTE)->as_string; # full URL
        =69=        my $local_file = catfile($LOCAL, split "/", $path); # native absolute file
        =70=        my $checksum_might_be_up_to_date = 1;
        =71=    
        =72=        if ($skip_if_present and -f $local_file) {
        =73=          ## upgrade to checked if not already
        =74=          $mirrored{$local_file} = 1 unless $mirrored{$local_file};
        =75=        } elsif (($mirrored{$local_file} || 0) < 2) {
        =76=          ## upgrade to full mirror
        =77=          $mirrored{$local_file} = 2;
        =78=    
        =79=          mkpath(dirname($local_file), $TRACE, 0711);
        =80=          print $path if $TRACE;
        =81=          my $status = mirror($remote_uri, $local_file);
        =82=    
        =83=          if ($status == RC_OK) {
        =84=            $checksum_might_be_up_to_date = 0;
        =85=            print " ... updated\n" if $TRACE;
        =86=          } elsif ($status != RC_NOT_MODIFIED) {
        =87=            warn "\n$remote_uri: $status\n";
        =88=            return;
        =89=          } else {
        =90=            print " ... up to date\n" if $TRACE;
        =91=          }
        =92=        }
        =93=    
        =94=        if ($path =~ m{^authors/id}) { # maybe fetch CHECKSUMS
        =95=          my $checksum_path =
        =96=            URI->new_abs("CHECKSUMS", $remote_uri)->rel($REMOTE);
        =97=          if ($path ne $checksum_path) {
        =98=            my_mirror($checksum_path, $checksum_might_be_up_to_date);
        =99=          }
        =100=       }
        =101=     }
        =102=   
        =103=     sub clean_unmirrored {
        =104=       find sub {
        =105=         return unless -f and not $mirrored{$File::Find::name};
        =106=         print "$File::Find::name ... removed\n" if $TRACE;
        =107=         unlink $_ or warn "Cannot remove $File::Find::name: $!";
        =108=       }, $LOCAL;
        =109=     }
        =110=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.